Via the Semantic web life sciences list: A Nature Biotechnology Perspective disucssing the importance of Semantic Web Technologies and their impact on 'omic standards, is available online (I was able to access this from home ? But a subscription may be required, which is really starting to annoy me). I just don't understand how anyone is supposed to discuss science online if all the information is locked up in walled gardens ?
Anyway the article is good, it is long, but worth the read if you care about how you spend your time doing bioinformatics or if you want to see what the future has in store for biological standards development. So now that you've taken the trouble to read the article and attempted to digest the significance of it, you're probablly thinking "So what ?" it has no real practical benefite for me now ?
It is true that articles like this do not explain these technologies from the point of view of the "working bioinformatician" i.e. what is in it for me if I bother to investigate RDF now ? Read on for some working examples of RDF data integration and a few thoughts on the future of the semantic web and the life sciences (warning, this is an article not a post, so grab a beverage or something).
In my oppinion the main selling point for RDF *right now* is as a way to avoid the often heard complaint of "I spend most of my time reformatting data files and not doing science". But how does RDF achieve this you may ask ? the answer is through its generality. RDF standardizes the model we use to communicate data, XML only standardizes the syntax. For RDF to work the model it uses must therefor be very general. This is both an advantage and curse for RDF, it makes RDF very useful as we shall see, but also makes it very hard get a concrete grasp of what it actually is. RDF is really very simple: it is a general graph based model (nodes and edges) for communicating data. In the next two paragraphs I will explain what the RDF data model consists of and the commonly used XML format for writing it into files.
The RDF data model is simple: everything (and I mean everything) is a resource, resources have properties, and those properties have values. Alternatively you might like to think in terms of objects and key value pairs associated with objects. What makes RDF a graph is the type of values that resource properties can have. Property values can be other resources, which means that resources can link to each other and form graphs. Property values can also be literals (i.e. strings) which can't link to other resources.
Without resorting to graphical distractions all that means is (in some kind of pseudo-code):
resource1->property1 = "some string";
resource1->property2 = resource2;
resource2->property = resource3;
Resource are also required to have identifiers; RDF uses URIs as identifiers. A URI is just the genral form of a URL (like the ones you type every day into your browser).
Lastly we have RDF/XML the RDF data format, which is just RDF represented as XML. RDF/XML is not fun to look at, the reason for this is that RDF is a graph, and XML is a tree. So RDF as constrained by XML looks weird, see here. For the moment don't worry too much about that (there are supposedly more reader friendly ways to express RDF).
And that's it... almost. There are numerous subtle issues with all of the statements I have made so far concerning RDF. Semantic web enthusiasts (they do exist) will happily point out that I failed to mention things like blank nodes in RDF/XML, proper use of URIs, why URLs are better than URNs, reification, contexts, named graphs etc. Ultimately the core of RDF is simple, and I don't want to muddy the waters right now.
So were does this get us ? Think for a moment about some of your bioinformatics problems. For example you might want to take some microarray data, combine that with pathway data and phenotype data, do some data mining, make lists of significant genes based on their pathway associations etc. Now if you're a PI you're probably thinking "So, my grad students do that all the time!", if your a grad student you're probablly cursing your advisor, and damning the fact that they don't appreciate how annoying it is to use 3 different parsers, ad hoc database tables, glue and duct tape just to get the gene lists.
First if you want a comprehensive view of your chosen microarray data, you have a choice of at least two different databases, NCBI's GEO and the EBI's ArrayExpress. No problem, there is an internationally accepted standard for Microarray data: MAGE-ML we only need one parser to deal with the data right ? Wrong, NCBI doesn't like MAGE-ML, this is quoted from their submission guide (my emphais):
We can usually accept MAGE-ML formatted data. However, MAGE-ML data may be structured in a variety of ways, so we first have to review the format and content of your files to determine that they contain all information required for successful GEO submission. Processing times for MAGE-ML submission can be substantially longer than for our other deposit routes, so if your data are not already MAGE-ML-formatted we recommend that you submit using the Web or SOFT deposit mechanisms.
So much for XML standards.
In stead they use something called Simple Omnibus Format in Text (SOFT). ArrayExpress use MAGE-ML for submissions and download, this may have something to do with the fact that they were heavily involved in the MAGE standardization process. Next you move on to pathway data; the situation is a little better as pathway databases seem to be standardizing on the Systems Biology Markup Language (SBML). There is a very good SBML parsing library, developed by the SBML group called (funnily enough) libsbml. So now you have three parsers, if you want to mix in gene ontology, you'll need another, then if you want phenotype data you'll probably have to deal with free text descriptions in OMIM. Once you've figured all the parsers and file formats out you then have to build your database. You open a text editor bang out a few MySQL tables, but then you realize that your microarray gene identifiers are different to your pathway identifiers: welcome to identifier mapping hell. Suddenly the simple description of the project your adviser gave you starts to look ugly.
Can RDF and the semantic web help ? The short answer is yes, it can. The long answer is won't solve all your problems right now, but after the reading the following example I hope you'll agree with me that it might be worth while keeping an eye on.
So let's get real!
In a moment you will build an integrated protein sequence and pathway database. In this example you will not have to write a parser, you will not have to deal with incompatible file formats and you will not have to build an ad hoc database to store your data. You will be able to do all this with off-the-shelf data and tools, as well as having a powerful query language to apply to your database. Remember, the power of RDF, as I mentioned in the beginning, is that it is general.
To begin, we will limit ourselves to the apoptosis pathway and the proteins involved in it. The pathway data we can get from reactome. Go to the front page, click on the apoptosis (human) pathway, and then scroll to the bottom of the page. There you will see a link for downloading the data in BioPAX format; BioPAX is an RDF compatible pathway file format. Here is a copy of the Reactome data I prepared earlier.
Next, protein sequences, for this we will use UniProt. Eric Jain at the Swiss Institute of Bioinformatics. has been dumping the entire uniprot database in RDF for some time now. You can go and get the data yourself, however the *compressed* database is over 1GB, and uncompressed over 10GB. This is the reason for only doing the apoptosis pathway, feel free to knock yourself out with the entire Uniprot database if you must. So you can follow along I've proved a sub-set of the UniProt RDF data as a separate download. For those interested in how to deal with large XML files see the libxml2 xmlTextReader interface.
So now we have our data, we need our tools. Go and grab the Redland RDF toolkit, compile and install it (or apt-get if you're smart). Redland is an RDF processing library written in object oriented C and comes with language bindings to just about everything: Perl, Python Ruby etc. We won't use the language bindings, just the command line utilities: rdfproc (parsing) and roqet (query).
Once you have Redland installed, to build the integrated database first create a new RDF data store (apoptosis-db) and add the apoptosis pathway data from reactome:
% rdfproc -n -s hashes apoptosis-db parse apoptosis-reactome.rdf
Next add the Uniprot data:
% rdfproc apoptosis-db parse apoptosis-uniprot.rdf
Basically you're done building the database. The rdfproc utility creates a DBM hashes database, which you can then query using the RDF query language: SPARQL. Querying RDF is not that painful, and in some ways is similar to relational SQL (graphs remember, not tables). Of course you do need to know the types of resources and the properties your graphs may have before you construct queries (take a look at the files, you'll get an idea).
Here are a few sample queries on the data. Note: the examples use roqet but not with the previous hashes database we created (that was done to demonstrate parsing disparate data into the same database), this time we use the text files with the RDF/XML data as sources. If you want to use the hases database (Redland also has support for MySQL) then you need to query via the language bindings.
First we select the subject identifiers (URIs) for each UniProt record (i.e. the URI of each UniProt resource):
% roqet -s apoptosis-uniprot.rdf query-uniprot.sq
The query looks like this:
PREFIX :
PREFIX rdf:
PREFIX bpx:
SELECT ?resource WHERE { ?resource rdf:type :Protein . }
And the results:
roqet: Querying from file query-uniprot.sq
roqet: Query has a variable bindings result
result: [resource=uri
result: [resource=uri
result: [resource=uri
result: [resource=uri
...
Next find all the resources with of type protein in the apoptosis pathway (the Reactome data this time) and their UniProt IDs:
% roqet -s apoptosis-uniprot.rdf query-reactome.sq
The query:
PREFIX rdf:
PREFIX bpx:
SELECT ?uniprotid
WHERE { ?x rdf:type bpx:protein .
?x bpx:XREF ?y .
?y rdf:type bpx:unificationXref .
?y bpx:ID ?uniprotid .
}
Again, the results:
roqet: Querying from file query-reactome.rq
roqet: Query has a variable bindings result
result: [uniprot=string("P25445"^^
result: [uniprot=string("P48023"^^
result: [uniprot=string("Q13158"^^
result: [uniprot=string("Q14790"^^
result: [uniprot=string("Q92851"^^
...
Note the conditions after the WHERE statement, they are in the form of triples: RESOURCE, PROPERTY, VALUE. So ?x is any resource who's type property is equal to protein. Note also that the value of the second triple ?y appears as the resource in the following statement. In plain english this query reads: find resources of type proteins, with cross references of the type unificationXref (i.e. the uniprot ids of the proteins) and tell us their UniProt ID. If you use the Redland language bindings you of course can have more power over the way you add, query and manipulate the data. For example select subgraphs, add new resources into the graph, connect the data to tools etc.
The next step is to query both graphs, and it is here that we run into a problem. You'll notice that protein resources in the Reactome data are linked to a UniProt ID, whereas the uniprot records are not identified using their uniprot ID rather a Life Sciences Identifier is used:
urn:lsid:uniprot.org:uniprot:P25445
Which means we can link resources in the graphs, and also cannot query both graphs. So we are back to identifier mapping hell once again.
So much for RDF.
Maybe. I am not intending to present RDF to you as a panacea for biological data integration. However, I believe RDF gets us one step closer, at a practical level, to realizing real data integration. When it comes to RDF there are plenty of grumbles. At the moment some stuff works, some stuff doesn't. There is some promise in RDF as an approach (you didn't see me write a parser did you ?) but it may not be the final word ? What is ? Who knows ?
Fatal flaw in my cunning plan ?
So did you spot the fatal flaw in this grand vision of semantic data integration ? Right you are, all the data must be in RDF. And yet the vast majority of biological data is not in RDF format, so why bother. There are a few solutions, XML data is relatively easy to convert and more and more data will become available in RDF, especially when you start to produce it :)
Second issue: you have only combined the data in the same triple store (RDF database) you did not really *integrate* the data, for example uniprot resource PXXXX is the same as Reactome referece YYYY, how can we make these one and the same (the integrated query problem). Good question, this will involve re-organizing the graph, which is a little be more detail than I wanted to go into. Google for RDF smushing if you're interested.
What if I combine tones of data of all different kinds, what then ? It would be too difficult to merge by hand crafted queries etc. It is still early days and there is no real solution to this problem. My bet is automated smushing based on graph isomorphisms, ontology mapping etc.
What about graph theory, what about inference and rule languages, what about ONTOLOGIES, what happened to the SEMANTIC bit. Go and look at the Jena RDF tool kit for their ontology and inference support, the documentation is great. Note: Redland doesn't support ontologies or inference it is intended to be more low level. Obsession with ontologies I believe has overshadowed the more practical foundations of RDF for the life sciences (no parsers), I might write more on ontologies in the future.
How do I know if I'm using the right URI for my resource ? What if two people use the same URI for two different resources, what if people use two different URIs for the same resource (i.e. the current identifier problem), won't all this semantic web stuff just fall over and die ? The answer to this is: we learn to live with confusion. Look around you, how many people do you see communicating clearly, effectively and unambiguously about the world around them, the events and interactions that take place in their daily lives ? Answer, very few. And why would you think it would be any different in the semantic web ? After all humans are building it.
Hold on, what about subjects, predicates and objects that I read about in the standard RDF documentation. RESOURCE, PROPERTY, VALUE is the same as subject, predicate and object. SPO terminology is a hangover from AI.
To be honest I am not a semantic web enthusiast anymore, I think the possibilities are interesting, but many of the reasoning and inference tools for RDF have their roots in the AI community. And the whole "machines will think using first order inductive logic" was a total failure. And it seems that a lot of AI people are infesting the semantic web, in a similar fashion to statisticians becoming bioinformaticians. The future of the semantic web will be in understanding how humans "live with confusion" and applying that understanding to the semantic web.
Enough. Please leave a comment if you made it this far or found any mistakes. Cheers.
Resources:
- http://www-128.ibm.com/developerworks/library/j-sparql/ - SPARQL with Java stuff
- http://jena.sf.net/ - Java RDF toolkit Jena (developed by HP)
- http://librdf.org/query/ - Redland online query interface, play with SPARQL.
- http://labs.intellidimension.com/uniprot/default.rsp - SPARQL + UniProt
- http://labs.intellidimension.com/uniprot/query2.rsp - Play with SPARQL + UniProt
- http://www.biostat.harvard.edu/~carey/hbsfin.html - If you *must* do all this in R


Comments
Knowledge discovery with the Semantic Web
It's possible to query RDF but is it possible to discover new knowledge using RDF data and ontologies with some machine learning strategies ? Do you know any examples ? Can Jena or R/Rswub do that task ? Recently I was interested in the Weka framework, but it seems that there is nothing about RDF data mining. I have seen that RDF and OWL can be extended with probability informations. The semantic network can be extended as a Bayesian one. My problem is that representing knowledge and information with RDF is not enough. What i want is a framework to display, query and derive new information.Something like that. Any idea welcome ;)
I don't know of any system
I don't know of any system based on the semantic web technology stack (RDF/OWL/SPARQL/RULEML) that can do what you ask. However in my travels I have come across the following papers that attempt to discover new knowledge based on similar concepts (graphs, inferencing, ontologies etc.), these might provide a few pointers:
When you originally posted this I couldn't remember these papers (shame on me for not keeping my citation database up to date). It was only during a recent paper "clean out" that I came across them again :)
A nice introduction to RDF
A nice introduction to RDF was just posted on the semantic web interest group mailing list. It is a bit long, and not very gentle, but appropriate if you know your way around Perl/Python/Ruby/XML etc.
Bunch of links here too.
Bunch of links here too.
Your favorite Semantic Web book
Nice introduction Greg. I'm exploring this direction for my own research. Do you have a book to suggest? I have seen this one that looks interesting: Explorer's Guide to the Semantic Web
This article was mostly off
This article was mostly off the top of my head after reading the nature article (with a little bit of time messing with the RDF data). So I'm happy that it came out as a nice introduction, or at least an understandable introduction.
I have at no point in my investigation of the semantic web read any definitive books on the subject. The Explorers' Guide to the Semantic Web "looks good" although I can hardly recommend it, having not read it. The concept is so amorphous it is hard to get a real hold of, I think I got there the hard way, by just doing a lot of reading. I will try an put together a few resources for the semantic web on the wiki. For the moment you can try the official home page for The Semantic Web for Life Sciences W3C working group.
Or, since I have a lot of spare time a the moment I might even start writing a book on the subject...
I'm beginning to get it
This is good stuff, I enjoyed it.
I used to think that the time I spend pre-processing and reformatting data before I can even use it for research was just part and parcel of bioinformatics. Now though, it just annoys me. I'm currently interested in retrieving intergenic regions from genomes (a fairly simple exercise in feature coordinate finding really) and I wish there were a web resource where I could simply type "NC_001234:1000..2000" or whatever and get that region in a fasta file. I posted a small piece over at one of my blogs on the challenges of parsing GenBank for this purpose.
Off-topic re: Nature - I hate their website for many reasons - bad design, flash ads, those useless careers editorials at the end - but mainly because of their nonsensical policy on what's premium content and what isn't. And they're not alone in that. The Science site is little better. You'd think the "big two" could get it together.
Nature v Science
I have to admit that I'd take Nature over Science in this respect: at least you can get to the content relatively easily. Science is an absolute nightmare. All that's missing is a row of blinking dots and an "under construction" sign...
The thing that floors me is that NPG are charging separate subscription rates for access to parts of their archives covering different periods. How extortionate is that? I'm sure it's been expensive to convert old back issues to pdf (which didn't exist in the 1880's), but really...