Is science document centric ?

Joe Gregorio has a rant on the notion of 'document centric', attempting to answer the question: why do people put data in spreadsheets that should be in relational databases ? What struck me after reading this was the obvious parallels in biology and bioinformatics. I mean, how many times have you seen biologists (for want of a better term) carefully collecting data in a spreadsheet that rightly should have been put in a relational database ? Is it that we are trained to think in terms of documents (papers) ? Is this why online scientific publishing is so difficult to grasp for most traditional academics ? Read on for bonus points...

Oh, all right then, maybe you were hoping for something more topical ? Try a recent show, afterwards just remember "Thank you Science!". Still not good enough, well he has returned, no not him, *him*, go add your voice to the chorus. Then go read everything Pedro has posted over the last few months, especially those posts discussing journal pre-print policies. Think software development for bioinformatics is incidental ? Get with the program, go read Moses M. Hohman's paper on agile software development for bioinformatics. It seems that PLOS has the most advanced CMS, it can be sparqled (don't worry, only semantic web people care). Lastly go poke around Pubmed, they've gone all Web 2.0 when we weren't looking. Anything I missed ?


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Two issues

There are really 2 issues in the Gregorio article. One is that many people (biologists) use spreadsheets when they should be using databases. I don't know if that's because they are document-centric. I suspect it's more because they don't know how to use databases. Or that in their minds "rows + columns = spreadsheet". Choosing the right software tool for the job requires more than the basic computer literacy (Office) that many people have. It is frustrating though - I've seen so many examples of data at my current workplace that should be in a database, perhaps with a nice web frontend to it, but is languishing in spreadsheets and wiki pages. It's just an education issue.
The second point is about database design. I must admit to being guilty of the "one table" database myself on occasion (though none of mine are quite so bad as the examples that he cites!) It's partly due to my lack of facility with complex SQL queries and partly because when you are given data, it's often in a delimited text format (most probably exported from a spreadsheet...) and the temptation to do a simple 'mysqlimport' is just too great. Most often it's quick and it works - yes, your queries might be 0.1 s faster with a better table structure, but is it worth spending a day on table schemas? As ever, "it depends".

The Agile paper gladdens me but not due to the content - more because it makes me think that we can publish almost anything. BMC Bioinformatics is not a bad journal, I wonder how this got in? There's no bioinformatics in it whatsoever, as far as I can see. I'm not saying that it isn't a useful study but it doesn't seem appropriate for the journal to me.


Dry your eyes...

Interesting what you can whine about. If people would put data in documents that can be normalized, I don't see any need to complain. However, I have seen many Excel-Files out there that can't even be sorted because rows have no identifiers or important information is solely kept by color (the horrors).
All time favourite: comments are continued in the next line. Data normalization should become a primer for every student in the life sciences just to get the very basics right.