What is a pipeline? For me, it' s series of steps that munch DNA/protein data, combines it with other data using various small scripts and outputs the results as diagrams or HTML. Do we want to code this kind of software as a script? If you think "makefile!" now, then you're much more clever than I was. But personally, until recently, I've glued my scripts together using other scripts. And used makefiles only for compiling my programs. That was a bad idea. (it's a quite detailed post, click on "read more" for the full article)
My analysis tasks consist of 60% parser/converters, a little bit of hashing to analyse things, to convert it into the right format for R and one final line to instruct R to plot the whole mess. This might be different for you, but I bet you have several small programs that convert data using a, analyse it with b, analyze it with c and output it with d.
Now let's say that you've finished part a. While developing b, you don't want to run a all the time. So you comment that line out. Later you comment b out. Then c. After having finished d you're debugging b again. Now you have to make sure c and d are run to check the final result. You remove their comments again. Then your boss is dropping in and wants a parameter in c changed. You comment out a and b, to make it run faster, change c and run it again.
That's OK for a pipeline with four steps. But my programs tend to grow and grow until I have 20 steps with different branches. It's a huge mess and after a while I've completely forgotten which part depends on which others. Then, at a visit to the Vingron group I've seen people typing in "make" all the time. It's so simple, I don't know why I've never thought of this.
I don't uncomment anymore. All my analyses start with a makefile in an empty directory. All data files have a sensible file extension, say ".input.gff", "genes.gff", etc and reside in their own data directory. Conversions are makefile-rules that convert one file (extension) to another one. Analyses steps have these files as preconditions. Whenever a data file changes, all steps that depend on it are redone. Whenever a script changes, all data files that it produces are redone. That is all very obvious for anyone with a little experience with makefiles, it simply didn't occur to use the whole machinery for my pipelines.
In case that you don't know what makefiles are (I admit that most nodalpoint-readers can completely skip this paragraph): (Unix or MacOs or Windows/Cygwin) Create a file "makefile" anywhere and put two lines into it. First line reads "test2.txt : test.txt", this tells make that the following commands need a file test.txt and will create a file test2.txt. The second line: tab (make sure that there is really a tab character) then "cat test.txt | tr 'H' 'B' > test2.txt", which will replace every H by B. Save this. Create a second textfile test.txt with the contents "Hallo world!". Now, if you run "make test2.txt", make will execute the commands and they will create test2.txt. If you run "make test2.txt" again, nothing will happen. Make has seen that test2.txt is newer than test.txt and won't repeat the commands. Change test.txt now. Save it. Run "make test2.txt" again. Make will recreate test2.txt.
This is exactly what you want your pipeline to do: Do not repeat steps if the files that they depend on are already there. Repeat steps if something has changed. Makefiles have tons of options, even a quite complete string substitution targeted towards filenames. "info make" should give you a reasonably complete documentation.
There are a couple of details that I've learnt now:
- All steps should depend on their respective scripts. If I change a script, all steps downstream of it are redone.
- All data files go into their own directory, all result files as well
- The VPATH variable let's you specify that preconditions are also searched in a couple of other directories. I set them to my data directories where the genomes, gene-models etc are.
- I knew "%" before but tend to forget it: a rule like this "%.gff: %.bed" will accept gff files and output bed files. The variables are then $< for the .gff file and $@ for the .bed file, so the command line will read: "gffToBed $< $@"
- sanity checks of parameters: ifndef BLA - $(error bla is not defined!) - endif (replace "-" with newline) will check if variable BLA is defined
- Of course, a makefile can call itself. Parameter-parsing is already built in (just mention P1=BLA on the command line), so to keep a track of the parameters you used for various analyses, add a target "MyRun" with no preconditions that calls "make analysis P1=BLA P2=BLA".
- By default, make is running all commands of a step in parallel. Sometimes this is not intended, e.g. if you're downloading data and don't want to open 100 connections at the same time. ".NOTPARALLEL: download" will execute one command at a time for the target "download".
- Results are cached if you put your applications on the web. If your filenames include the parameters, than even if someone needs a file that takes a long time to produce, it might be around already and as such will be used automatically. Let's say user A searches the genome for TATAA and this is saved in a central directory as TATAA.gff by the makefile. For everyone else afterwards who's running the pipeline with the same pattern (make pattern=TATAA) the makefile will make sure that the old results are directly re-used. OK, I admit this is a rare case... :-)