One Thousand Databases High (and rising)

StampsWell it's that time of year again. The 15th annual stamp collecting edition of the journal Nucleic Acids Research (NAR), also known as the 2008 Database issue [1], was published earlier this week. This year there are 1078 databases listed in the collection, 110 more than the previous one (see Figure 1). As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?

R.I.P. Biological Data?

It seems highly likely that lots of this data is stored in what Usama Fayyad at Yahoo! Research! Laboratories! calls data tombs [2], because as he puts it:

“Our ability to capture and store data has far outpaced our ability to process and utilise it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.”

Like last year, lets illustrate the growth with an obligatory graph, see Figure 1.

Figure 1: Data growth: the ability to capture and store biological data has far outpaced our ability to understand it. Vertical axis is number of databases listed in Nucleic Acids Research [1], Horizontal axis is the year. (Picture drawn with Google Charts API which is OK but as Alf points out, doesn't do error bars yet).

Another day, another dollar database

Does it matter that large quantities of this data will probably never be used? How could you find out, how much and which data was "write-only"? Will Biologists ever catch up with the physicists when it comes to Very Large stamp collections Databases? Biological databases are pretty big, but can you imagine handling up to 1,500 megabytes of data per second for ten years as the Physicists will soon be doing? You can already hear the (arrogant?) Physicists taunting the Biologists, "my database is bigger than yours". So there.

Whichever of these databases you are using, happy data mining in 2008. If you are lucky, the data tombs you are working will contain hidden treasure that will make you famous and/or rich. Maybe. Any stamp collector will tell you, some stamps can become very valuable. There's Gold in them there hills databases you know...

  1. Galperin, M. Y. (2007). The molecular biology database collection: 2008 update. Nucleic Acids Research, Vol. 36, Database issue, pages D2-D4. DOI:10.1093/nar/gkm1037
  2. Fayyad, U. and Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8):28-31. DOI:10.1145/545151.545174
  3. Stamp collectors picture, top right, thanks to daxiang stef / stef yau

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

It doesn't matter, they got published

Does it matter that large quantities of this data will probably never be used?

Probably not, as their authors are already happy enough to have those published, and then they can stop mantaining move on to another project publication. It is indeed a sad state of affairs.


Why NAR

I add the usual defense of the NAR databases: At least they are indexed by pubmed, so as there are biologists that are not used to checking in google if there are websites about their subject, they will find them quicker via NAR. The data is not dead, websites can always be exported. (I've made the experience that for smaller databses it's usually quicker to scrape the data from an html page with something like HTTP::Recorder than writing to the person responsible for the data to send you an sql dump.) In addition, people get papers for their databases like this and other people can cite the database properly, so NAR makes the web citable and advances someone's career a little bit. It already eases the transition from a traditional paper-based science to a more web-orientated world. Databases are peer-reviewed, so they don't contain complete crap, a paper in NAR assures some minimal quality. And a write-only database is better than none at all, at least someone has collected something and you can scrape it from its tomb.

I would just love to see a minimal requirement for a publication in NAR: They should all offer some simple text-based export, e.g. tab-delimited flatfiles. That could save me a lot of time...

Of course, large quantities of these data are never used and never read. But, heck, this is research, right? 90% will not be used in the end. It's not too different from those 500 alignments algorithms, 200 genomic analyses, hundreds of papers that describe "new" cloning strategies, yet another new gene, etc. I don't believe that the physisists use 100% of their 1,5 GB / sec either.


Dude, Where's My NAR?

Hello Maximillian, I wasn't attacking NAR, was just wondering about what proportion of the data is dead. It is an interesting technical challenge to find this out. It would also be useful to measure the cost of gathering noisy, redundant and poorly understood data in terms of wasted resources (people, time, money, computers, false positives etc). Perhaps somebody has done something like this already? Especially with all the irresponsible sequencing just for sake of it that goes on...

As for the physicists, I mentioned them for comparison. Like you, I doubt they will use 100% of their data either, but will probably use much more of it. However, they keep delaying switching their big machine on, which means they are still waiting for the data. That is, unless when they finally flip the switch on the LHC, we all disappear into a black hole, tombs and all :)


Also, think of all the

Also, think of all the redundant effort that goes into setting up the basic technical infrastructure each database needs (data storage, user interfaces etc)...


Swiss Prot Databases / Action Figures

Hi Eric, yeah even more redundancy there. Talking of SWISS Prot people, I'm just wondering when your Amos Bairoch action figure will be available in shops?!?! What are you up to now that you've moved on from SWISS Prot to Seattle?


Yeah, need to update my

Yeah, need to update my blog...


700?

At 700 a pop for an Action Figure?



Warning: Table './nodalpoint/watchdog' is marked as crashed and last (automatic?) repair failed query: watchdog INSERT INTO watchdog (uid, type, message, severity, link, location, referer, hostname, timestamp) VALUES (0, 'php', '<em>Table &amp;#039;./nodalpoint/accesslog&amp;#039; is marked as crashed and last (automatic?) repair failed\nquery: statistics_exit\nINSERT INTO accesslog (title, path, url, hostname, uid, sid, timer, timestamp) values(&amp;#039;One Thousand Databases High (and rising)&amp;#039;, &amp;#039;node/2326&amp;#039;, &amp;#039;&amp;#039;, &amp;#039;54.243.14.193&amp;#039;, 0, &amp;#039;2374218157eb68a6c111bc0a84715b44&amp;#039;, 1550, 1398360911)</em> in <em>/var/www/archive.nodalpoint.org/htdocs/includes/database.mysql.inc</em> on line <em>172</em>.', 2, '', 'http://archive.nodalpoint.org/2008/01/18/one_thousand_databases_high_and_rising', '', '54.243 in /var/www/archive.nodalpoint.org/htdocs/includes/database.mysql.inc on line 172