Not waving but drowning?
The 14th annual Nucleic Acids Research (NAR) database issue 2007 has just been published, open-access. This year is the largest yet (again) with 968 molecular biology databases listed, 110 more than the previous one (see figure below). In the world of biological databases, are we waving or drowning?
Nine hundred and sixty eight is a lot of databases, and even that mind-boggling number is not an exhaustive or comprehensive tally. But is counting all these databases waving or drowning [1]? Will we ever stop stamp-collecting the databases and tools we have in molecular biology? What prompted this is, an employee of the The Boeing Company once told me they have given up counting their databases because there were just too many. Just think of all the databases of design and technical documentation that accompanies the myriad of different aircraft that Boeing manufacture, like the iconic 747 jumbo jet. Now, combine that with all the supply chain, customer and employee information and you can begin to imagine the data deluge that a large multi-national corporation has to handle.
Like Boeing, in Biology we've clearly got more data than we know what to do with [2,3]. It won't be news to bioinformaticians and its been said many times before but its worth repeating again here:
- We know how many databases we have but we don't know what a lot of the data in these databases means, think of all those mystery proteins of unknown function. It will obviously take time until we understand it all...
- Most of the data only begins to make sense when it is integrated or mashed-up with other data. However, we still don't know how to integrate all these databases, or as Lincoln Stein puts it “so far their integration has proved problematic” [4], a bit of an understatement. Many grandiose schemes for the “integration” of biological databases have been proposed over the years, but unfortunately none have been practical to the point of implementation [5]
Despite this, it is still useful to know how many molecular biology databases there are. At least we know how many databases we are drowning in. Thankfully, unlike Boeing, most biological data, algorithms and tools are open-source and more literature is becoming open access which will hopefully make progress more rapid. But biology is more complicated than a Boeing 747, so we've got a long-haul flight ahead of us. OK, I've managed to completely overstretch that aerospace analogy now so I'll stop there.
Whatever databases you'll be using in 2007, have a Happy New Year mining, exploring and understanding the data they contain, not drowning in it.
References
- Stevie Smith (1957) Not waving but drowning
- Michael Galperin (2007) The Molecular Biology Database Collection: 2007 update Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1008
- Alex Bateman (2007) Editorial: What makes a good database? Nucleic Acids Research, Vol. 35, Database issue. DOI:10.1093/nar/gkl1051
- Lincoln Stein (2003) Biological Database Integration Nature Reviews Genetics. 4 (5), 337-45. DOI:10.1038/nrg1065
- Michael Ashburner (2006) Keynote at the Pacific Symposium on Biocomputing (PSB2006) in Hawaii seeAlso Aloha: Biocomputing in Hawaii

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
- Duncan's blog
- Login to post comments



Comments
Quantity not the problem
Not drowning. More data are good - the more, the better. Only a few of those databases are relevant to an individual researcher.
As others have already commented, the problems are (1) the quality of the databases, (2) their diverse, "higgledy-piggledy" nature (no standards, APIs, integration) and (3) their longevity, or lack thereof. Frankly, anyone with a modicum of SQL and CGI knowledge can put a database or web app online. So they do. You can't legislate against bad web resources.
I would question whether these annual issues still serve any useful purpose, other than to make the journal appear authoritative or provide an avenue for an easy publication. If I'm looking for an online resource I start with Google, not an outdated journal article.
Need standards
Great article Duncan, thanks for bringing this on.
I have seen lot of people accessing low quality data from many well-known db's and high quality data in not-so-well-known ones.
For eg, GBK files mostly does not talk about quality while it's ASN.1 counterpart might offer it [ http://getentry.ddbj.nig.ac.jp/cgi-bin/get_entry.pl?AF207953 ].
Regarding algorithms to analyze these data, any comment will be like a troll.
To minimize this, I feel, something like Bioinformatics oriented DIGG will be great.
Vaporware
Each time a new annual issue of NAR is published I remember this paper from Nature.
http://www.nature.com/nature/journal/v435/n7045/full/4351010a.html
Databases in peril
Zeeya Merali and Jim Giles
Nature 435, 1010-1011 (23 June 2005)
doi: 10.1038/4351010a
Nature contacted 89 databases listed in the Molecular Biology Database Collection (Nucl. Acids Res.28 1−7; 2000) to see how many still have funding five years on. Of these, 51 reported that they are struggling financially. Seven of these have closed; the rest are being updated sporadically in their owners' spare time.
Pierre
Databases in Peril
Thanks for all your comments, here are some thoughts...
Quantity is a significant problem we're not just talking about individual databases getting bigger and bigger like GenBank, we're talking about more different types of databases. Potentially we want to allow the combination of data from any of these different databases and others that will appear in the future. Obviously, any given researcher probably isn't going to want to search all 900+ databases, but it would be beneficial to the wider scientific community if all these databases can easily interoperate. The more databases there are, the more challenging easy interoperation becomes, because there is more heterogeneity, more API's, more schemas etc.
Peer-reviewed publication can help assess quality this is what peer-review is for. The editors of this issue claim to look for good quality data as well as a good quality interface. As pointed out in the comments above “anyone with a modicum of knowledge can put a database or web app online”. By itself, this is not enough for publication. It is no good having great data with an awkward non-standard interface and vice versa. The NAR database issue may well be an “easy” publication, but it doesn't make it any less important. The Databases in Peril article, wouldn't have been possible if NAR hadn't been faithfully recording all this information in the first place. I suspect publication in the NAR database issue is harder than some suggested, it's not just a case of shoving a database on the web then writing a paper about it, you have to convince the reviewers the database is worthy: novel, useful and usable.
Churn is inevitable but the overall trend is still upward Databases (and tools) are not immortal, some are bound to wither and die eventually. Since last year 11 databases have gone this way, and the article, discusses why. The general trend is still upward and will probably keep going. In the long run, the longevity of database can be an indicator of its quality because somebody cares and is skilled enough to maintain and fund it for a long period of time. As for the databases that are “struggling financially” (according to Nature) how is this news? Struggling could mean anything. Haven't you always had to fight for sustained funding of any scientific project?
Standards are boring (but important) it can be difficult to get standards work funded, done and published, what John Quackenbush calls Blue-collar science. It is unglamorous but essential work, and nobody is going to win a nobel prize for creating a standard schema, ontology or whatever. What is the research contribution of creating a standard? Novelty? Discovery of new knowledge? This is partly why we have chaos, creating standards, in itself is often not considered “science” or “research”. But without them, science is much harder.
Integrated Search is hard We would all like integrated search “from one box”, but the way to do this is still very much an open research question, not just in bioinformatics, but for computer science also. What is more, this is not merely an “IT problem”, there are novel and serious scientific challenges in achieving this. If it was easy and straight forward to provide integrated search to all these databases, don't you think somebody would have done it by now? Until that time, we have Entrez Global Query...
Drowning!!!
I think that we're going to drown at this rate. Not because there are too many databases. Those can, and perhaps should be spread far and wide. My concerns
Quality. How do we know whether the results in our hands are any good? Can we glean meaningful knowledge from them?
Integrated search. I don't want to go to every database and search there. I want to search from one box
Standards. I want the data to follow certain minimum standards.
What was that about airlines :-)?
My Blog: http://mndoci.com