stamp collecting edition of the journal Nucleic Acids Research (NAR), also known as the 2008 Database issue , was published earlier this week. This year there are 1078 databases listed in the collection, 110 more than the previous one (see Figure 1). As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?Well it's that time of year again. The 15th annual
R.I.P. Biological Data?
It seems highly likely that lots of this data is stored in what Usama Fayyad at Yahoo! Research! Laboratories! calls data tombs , because as he puts it:
â€œOur ability to capture and store data has far outpaced our ability to process and utilise it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.â€
Like last year, lets illustrate the growth with an obligatory graph, see Figure 1.
Figure 1: Data growth: the ability to capture and store biological data has far outpaced our ability to understand it. Vertical axis is number of databases listed in Nucleic Acids Research , Horizontal axis is the year. (Picture drawn with Google Charts API which is OK but as Alf points out, doesn't do error bars yet).
Another day, another
Does it matter that large quantities of this data will probably never be used? How could you find out, how much and which data was "write-only"? Will Biologists ever catch up with the physicists when it comes to Very Large
stamp collections Databases? Biological databases are pretty big, but can you imagine handling up to 1,500 megabytes of data per second for ten years as the Physicists will soon be doing? You can already hear the (arrogant?) Physicists taunting the Biologists, "my database is bigger than yours". So there.
Whichever of these databases you are using, happy data mining in 2008. If you are lucky, the data tombs you are working will contain hidden treasure that will make you famous and/or rich. Maybe. Any stamp collector will tell you, some stamps can become very valuable. There's Gold in them there
hills databases you know...
- Galperin, M. Y. (2007). The molecular biology database collection: 2008 update. Nucleic Acids Research, Vol. 36, Database issue, pages D2-D4. DOI:10.1093/nar/gkm1037
- Fayyad, U. and Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8):28-31. DOI:10.1145/545151.545174
- Stamp collectors picture, top right, thanks to daxiang stef / stef yau