Reproducible research with the publication of a "compendium"

From Faculty1000 (sub only) I found this paper, from the people of the bioconductor project on a new way to publish results as a document where code and words are woven together to create a "compedium". The document can then be browsed and the code changed in an interactive way. The authors give a concrete example using R packages and data from Golub et al. on cancer classification by expression data analysis.

I never used R so I'm still messing around trying to install everything to try this properly but at first glance it looks like an interesting concept. This way with the publication of results you would get immediately the methods, you could change parameters to check some hypothesis, etc. It would certainly help referees to check some ideas quickly. Something like this could also be used in-house as a personal e-lab book to keep track of code, data and ideas.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I just noticed this same

I just noticed this same article on Faculty1000, with a rating of 9.0 (Exceptional). I find that odd given that there is prior art. See The Next Big Thing: From Hypermedia to Datuments.


Credit...

...where it's due. I read the Gentleman and Lang paper at least two years ago when I started using Sweave heavily.


Without requiring R

I do really like this idea, but I imagine that in a lot of fields (especially biology) there aren't many people using R. It would be a good start to just ask people to include their Prism files, or Excel spreadsheets, or Flowjo documents, or data files from whatever program they use for analysis, alongside a paper - to make it possible to easily double-check results if anyone really wants to.


Good point, bad implementation

"I imagine that in a lot of fields (especially biology) there aren't many people using R"

You'd be right. Most biologists that I know have not even heard of R and that, for me, is where this concept rather falls over.

I think it is an interesting idea and I think they are trying to make a point - namely, that access to primary data and ability to easily reproduce results are often lacking in the publication process. However, the next question to address is how do we make the "average" biologist understand that this is important and implement systems that they can use easily to make things better?

They then fall into two traps - the "I like R, I use it for everything, so could everyone else" trap and the "take on the established wisdom with a crazy idea" trap. Frankly, you are not going to (a) persuade the majority of people to learn a lot of programming syntax just to write papers or (b) convince them that there are big problems with the current system of publication. A scientific paper is a compressed and abbreviated document for many reasons - space limitations, speed of submission and so on but many people, myself included, rather like it this way. If you've written a paper, you've doubtless realised that condensing your results and discussion down to the absolute key points is in fact an important mental process for the writer - it helps you to summarise and explain what it's all about, in your own mind. The addition of cluttered R syntax is almost counter to this mental process.

So, how to get authors to provide raw data and analyses? Keep it simple - my suggestion, group web servers. Easy and fun to set up, a public place where you can just drop whatever you like (files, code in ViewCVS), an educational process for biologists and likely to lead to other useful developments in their labs (group intranet, blogging and so on).


I think the raw data is

I think the raw data is already being taken care of, with most journals requiring submission to public repositories.

I agree that you would never get people to learn a complex syntax to write reports, but that's really not the utility of such a system (at least for me). If you're already writing code, particularly in something like R which tends to be used for (exploratory) analysis, generating a dynamic document to accompany your analysis can be very useful. This is particularly true for interactive session based languages (R and python spring to mind) where things are done on the fly.

Although it won't help with summarising and explaining work, this approach just might get rid of the "read the source code" mentality.


Sweave rocks

I've been using Sweave for a couple of years to generate dynamic reports. It's a useful concept once you get your head round it, as it allows you to document a piece of code on the fly, and write a report round the results at the same time. If, like me, you tend to go off on wild tangents at the slightest provocation, concurrently writing a report is a good way to keep on track.

Something that's not quite obvious is how to integrate multiple languages. You could do this by foreign calls from R (to eg perl, python, shell, SQL etc; bindings are quite good), but that's a bit of a smelly hack. The whole point is to shift primary focus to the document, rather than use it as a placeholder for a complex R session.

My other concern is that recalculating a large analysis isn't really feasible, especially in R. All in all, though, I think it's a rather nifty system.

P.S. you should check out Emacs Speaks Statistics (ess) if you're going to use R -- the added value is substantial.


Nice

That is a really cool and, I believe, useful idea. Almost makes me want to try it for my next project. Or at least learn R!