A pipeline is a makefile

What is a pipeline? For me, it' s series of steps that munch DNA/protein data, combines it with other data using various small scripts and outputs the results as diagrams or HTML. Do we want to code this kind of software as a script? If you think "makefile!" now, then you're much more clever than I was. But personally, until recently, I've glued my scripts together using other scripts. And used makefiles only for compiling my programs. That was a bad idea. (it's a quite detailed post, click on "read more" for the full article)

My analysis tasks consist of 60% parser/converters, a little bit of hashing to analyse things, to convert it into the right format for R and one final line to instruct R to plot the whole mess. This might be different for you, but I bet you have several small programs that convert data using a, analyse it with b, analyze it with c and output it with d.

Now let's say that you've finished part a. While developing b, you don't want to run a all the time. So you comment that line out. Later you comment b out. Then c. After having finished d you're debugging b again. Now you have to make sure c and d are run to check the final result. You remove their comments again. Then your boss is dropping in and wants a parameter in c changed. You comment out a and b, to make it run faster, change c and run it again.

That's OK for a pipeline with four steps. But my programs tend to grow and grow until I have 20 steps with different branches. It's a huge mess and after a while I've completely forgotten which part depends on which others. Then, at a visit to the Vingron group I've seen people typing in "make" all the time. It's so simple, I don't know why I've never thought of this.

I don't uncomment anymore. All my analyses start with a makefile in an empty directory. All data files have a sensible file extension, say ".input.gff", "genes.gff", etc and reside in their own data directory. Conversions are makefile-rules that convert one file (extension) to another one. Analyses steps have these files as preconditions. Whenever a data file changes, all steps that depend on it are redone. Whenever a script changes, all data files that it produces are redone. That is all very obvious for anyone with a little experience with makefiles, it simply didn't occur to use the whole machinery for my pipelines.

In case that you don't know what makefiles are (I admit that most nodalpoint-readers can completely skip this paragraph): (Unix or MacOs or Windows/Cygwin) Create a file "makefile" anywhere and put two lines into it. First line reads "test2.txt : test.txt", this tells make that the following commands need a file test.txt and will create a file test2.txt. The second line: tab (make sure that there is really a tab character) then "cat test.txt | tr 'H' 'B' > test2.txt", which will replace every H by B. Save this. Create a second textfile test.txt with the contents "Hallo world!". Now, if you run "make test2.txt", make will execute the commands and they will create test2.txt. If you run "make test2.txt" again, nothing will happen. Make has seen that test2.txt is newer than test.txt and won't repeat the commands. Change test.txt now. Save it. Run "make test2.txt" again. Make will recreate test2.txt.
This is exactly what you want your pipeline to do: Do not repeat steps if the files that they depend on are already there. Repeat steps if something has changed. Makefiles have tons of options, even a quite complete string substitution targeted towards filenames. "info make" should give you a reasonably complete documentation.

There are a couple of details that I've learnt now:

  • All steps should depend on their respective scripts. If I change a script, all steps downstream of it are redone.
  • All data files go into their own directory, all result files as well
  • The VPATH variable let's you specify that preconditions are also searched in a couple of other directories. I set them to my data directories where the genomes, gene-models etc are.
  • I knew "%" before but tend to forget it: a rule like this "%.gff: %.bed" will accept gff files and output bed files. The variables are then $< for the .gff file and $@ for the .bed file, so the command line will read: "gffToBed $< $@"
  • sanity checks of parameters: ifndef BLA - $(error bla is not defined!) - endif (replace "-" with newline) will check if variable BLA is defined
  • Of course, a makefile can call itself. Parameter-parsing is already built in (just mention P1=BLA on the command line), so to keep a track of the parameters you used for various analyses, add a target "MyRun" with no preconditions that calls "make analysis P1=BLA P2=BLA".
  • By default, make is running all commands of a step in parallel. Sometimes this is not intended, e.g. if you're downloading data and don't want to open 100 connections at the same time. ".NOTPARALLEL: download" will execute one command at a time for the target "download".
  • Results are cached if you put your applications on the web. If your filenames include the parameters, than even if someone needs a file that takes a long time to produce, it might be around already and as such will be used automatically. Let's say user A searches the genome for TATAA and this is saved in a central directory as TATAA.gff by the makefile. For everyone else afterwards who's running the pipeline with the same pattern (make pattern=TATAA) the makefile will make sure that the old results are directly re-used. OK, I admit this is a rare case... :-)

Update 2013: Six years after this post, I've stopped using makefiles. The concept of dependencies is right, but the syntax was too a awkward and they don't scale well for big projects. To make them work requires setting weird special variables, so no one else but you will understand the makefile. Habits have evolved since 1977 and few people are experts in makefiles anymore (unlike Mark Diekhans :-) ). I now write simple Perl or Python scripts that require a single parameter, the step where to start processing. This is not as automatic, but easier to read for others and myself. I'm still looking for makefile alternatives in bioinformatics, if you think you have found a good one, drop me an email at maximilian at gmail com, as the comments are closed. Big thanks to Neil Saunders for keeping this blog working for so long!


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

model multiple outputs in make

Hi, I made a serious effort to use make in our pipeline, but there is one limitation that seems really hard to work around: make doesn't understand programs generating multiple outputs. Let's say an alignment program generates a psl in stdout and a log file in stderr. I can't write

f.psl: f.fasta
align f.fasta > f.psl

f.log: f.fasta
align f.fasta 2 > f.log

unless I want to repeat the computation twice, which is of course unacceptable unless one has unlimited resources available. Since I was using make to, among other things, control parallel execution on the cluster, this redundant computation defeats the purpose of using make. Skam doesn't seem to address the problem, but I took only a look at the documentation. Any comments?


model multiple outputs in make

hm... I might have missed something, but why don't you write:
f.psl f.log : f.fasta
blat > f.fasta 2> f.log


Doesn't change a thing

What you wrote unfortunately is equivalent to what I wrote. It's like a shorthand notation for two separate rules, I should have explained it right away. It is going to execute the command twice. What we need is a way to express that a command has multiple outputs, and rules with multiple targets don't accomplish that.


A hack (not too kludgy)

Designate one of the multiple output as a representative file (say foo.psl), touch the other files at the end of the commands, and have the rule: foo.log foo.whatever ...: foo.psl


what about taverna, scufl..?

What do you think about projects like taverna (http://taverna.sf.net) and its scufl language?
I've been trying to use it to organize an analysis of mine on some genes in yeast.
Unfortunately I'm not so good with Java and with work organization in general... but I will try to learn to use it in the future.
I also would like to learn UML, since I'm starting with OO programming.

I know that there are other projects like taverna to organize bioinformatics workflows, but I've never used them.


taverna, biomake

Exactly. You like the idea, it's just that you've never used them. Makefiles are 20 years old, there's TONS of documentation for it, everyone can read them. They are installed everywhere. They are awkward to write at first, but fast and proven. Almost no bugs.
I don't know scufl nor biomake. But I bet that they have little documentation and almost no user base and quite a few bugs.
Sorry, tavera and biomake folks. :-) : But the whole pipeline stuff is _support_ for me, not my main occupation. I already have enough problems, no need to fumble around with yet another "ready for publication" software.


taverna has an active user base

Taverna isn't trying to replace make, it does some things that are a bit like make, but other things that are quite different. As for taverna users, there is an active and growing community judging by the mailing list archive on sourceforge over the last four years. I think a common theme in bioinformatics software (Taverna, BioMake and many others) is trying to develop tools that provide the right level of abstraction, so people can get on with the Science, rather than spending all their time battling with and reinventing various software wheels.


taverna

Yes, which is why I'm waiting until I see people applying taverna for papers. I'm sceptical when it comes to "ubertools" (similar: generic parsers, generic browsers, generic languages for dynamic programming, etc...) for our very special ("throw away") science applications but I'm open to get convinced of the contrary. For the time being I prefer a wheel that has been out there rolling for a while.


I agree with your sentiment

I agree with your sentiment regarding generalized 'ubertools'. So much effort creating is expended creating a generalized workflow editor, that the editor becomes and end in itself.

Another thing that gets me about graphical workflow editors is that for anything other than the most common tasks, you need to revert to scripting to plug outputs into inputs. So why not scripting in the first place ? Maybe I'm not the right audience for Taverna...


Why not scripting?

Scripting is great, a powerful tool that lets you achieve world peace in three lines of PERL/Python/Ruby. But what if people don't want to hack scripts? According to Grady Booch, the history of software engineering is one of increasing levels of abstraction, which is where Taverna and workflows are trying to go. Admittedly, we're not quite there yet, sometimes the abstractions leak, and as Stew says, you end up hacking BeanShell, thats not really a problem with Taverna or workflows, its an inherent problem in bioinformatics data, a flat-file legacy nightmare, that means we'll be forced to hack scripting languages for a long time to come, whether we like it or not.


taverna papers

Not sure what you mean by "applying taverna for papers", but here is one example of what I think you might be interested in, the successful application of taverna workflows to a difficult problem.


you can switch taverna <-> makefile

Yeah, but in principle it should be possible to switch from makefile <-> taverna scufl file by writing a few scripts, or at least I think so.

One of the things I like of taverna are the diagrams you can produce, which are useful to describe how your project works.
Maybe you can obtain diagrams from makefiles, also.
Moreover having an user base (I am in the taverna one) allows to share scripts, as in Bio{Perl, Python, ...}, and using a more scientific-specific tool has its advantages.


diagrams

Erm...I don't see why anyone would need diagrams of a pipeline? I can draw them in powerpoint one day, when I really need them...


diagrams and pipelines.

Too bad you don't use diagrams to describe your work.
It's very useful to show how your pipeline works to other people, and to yourself.. it helps you in writing the right test units for your programs and to see which improvements/changes you can do to them.
Also, it's easier to compare two experiments when they are described with a diagram produced with the same syntax.
I wonder why wet biology scientits don't use diagrams to describe their experiments when writing papers, too (bioinformaticits will be very happy then).
If you try to draw the diagram of the pipelin you've described in your post, I will understand it easily.


powerpoint vs graphviz diagrams

You could draw them in PowerPoint, but if they start getting complicated, it is time consuming. Having a standard way of drawing pipelines/workflows is a handy tool for quickly communicating what your analysis does graphically. Taverna uses GraphViz to do this, see for example this workflow diagram. I dunno about you, but I wouldn't fancy drawing that figure manually in PowerPoint, when GraphViz can do most of the hard work for me.


diagrams

The workflow diagram that you provide is for me not a real diagram. It's too complicated. I don't see a lot of value in it.

If a diagram is getting complicated it's taking too much space in you paper, isn't it? I'd rather write the _details_ as a text.

If a diagram is simple, I can also draw it myself and optimize the layout in a more human-friendly way than graphviz can.

We know the problem from software engineering, right: I don't believen in the added value of UML-generating tools. You end up with thousands of diagrams that no one can read anymore. If diagrams are created automatically they loose their value as a means to simplify and over-simplify the system.

Biologists are using textual protocols for a reason: They are much more compact and as everything is linear for them, there's no real need for the expressive power of flowcharts: complex branches or repetitions. I don't have these neither in my pipelines as they are rather linear in structure.


Taverna

I don't get it. The diagram is too complex? Of course it is complex. But everything is pointing towards a workflow (SOA) world. Why does bioinformatics deny this turn? Oracle, SAP, IBM and so on all rewrite all their applications so they are able to use BPEL as graphical buil environment for integration processess. So stick in the 90s and keep using scripts, but within a few years we all will use workflows in a SOA world. I really believe this will happen, IBM believes in it so does the whole IT world. Let's see who's right.


Lets agree to disagree

Many taverna users I've been in contact with find the visualisations very useful. Clearly you're not one of these people, so lets agree to disagree.


BioMake

Yeah this is definitely something that makes sense.

Ian Holmes has proposed aspects of this in one of their Google Summer of Code project.

Chris Mungall also has talked about a BioMake project that grew out of some of his tools that were developed initially for BDGP pipelines. He talked about BioMake at BOSC2004 and here is his presentation.

I think it would be great to have this more generally available especially since many of the queuing software tools allow for dependency driven/make aware submission of jobs.


If you do enough research on

If you do enough research on using make for bioinformatics pipelines, yes, you eventually come across Chris Mungall's BioMake. I have read through the presentations and found the rational behind it quite lucid, however I never had the tenacity to figure out the installation and setup of BioMake. At the moment I'm still stringing things together with python.

If anyone is interested in the philosophy of bioinformatics in the large, I highly recommend reading Evolving from bioinformatics in-the-small to bioinformatics in-the-large. (pdf copy online here). This paper also describes the use of make for bioinformatics pipelines.


made guys

Thanks for that Chris Lee paper... I've been using 'make' for pipelines since 1996 (I actually built my entire thesis using makefiles, from analysis to latex to postscript). However it has serious limitations, as documented by Andrew Uzilov on our wiki.

As Jason pointed out, we are offering a Summer of Code proposal (applications this week!) to boost 'make' past these limitations. This largely grew out of discussions with Chris Mungall on his biomake project (which also has a biowiki page detailing some of the high-level design goals).

I think Chris has done as good a job as anyone of defining the high-level goals for such a tool: declarative structure, shell script hooks, flexible dependency tracking (MD5 etc rather than just timestamps), facility to build database tables rather than just files, advanced pattern-matching (not just one wildcard per rule, as in 'make'), parallel execution on a cluster and (ideally) a Turing-complete functional programming syntax so that you can start to do low-intensity computation within the pipeline language itself.

Sadly, with Chris now doing more ontology work than genome annotation, biomake has stalled somewhat. There are some practical alternatives, e.g. Perl modules such as Shengqiang Shu's SAPS modules (used by the Berkeley Drosophila Genome Project for their pipeline); and then there are some pie-in-the-sky (but theoretically appealing) functional language-based alternatives, like Erlang (based on Prolog) or Termite (a dialect of Scheme/Lisp).

In the meantime, there are several versions of make that can use GNU make's remote stubs feature to do parallel execution on a cluster. These include distmake, qmake and omake (which also has MD5-based dependency tracking). And of course there are things like Apache Ant, but then you're moving too far away from the command line for my liking, personally ;-)

As you might gather from that last sentence I'm not exactly a proponent of the Taverna-style approach. I like what they're doing but I completely agree with the previous commenter that graphical editors tend to become an end in themselves. I think that there is quite enough to do with developing a workable domain-specific declarative programming language for pipelines without trying to build Yahoo Pipes at the same time. But that's just me, I'm a born-in-the-20th-century fogey; what can I say.

If anyone reading this is in the Bay Area at noon on Wednesday April 4th 2007, btw, we're having a lab meeting to discuss exactly this issue (make and successors). Come to my lab, 425 Hearst Mining Building on the Berkeley campus, and meet some other "made guys".

http://biowiki.org/IanHolmes


make alternatives: makepp?

Great, these wiki pages. My web-searching-reflexes still aren't good enough. I haven't searched for "bioinformatics pipelines make" before submitting the post.

Has anyone of you Perl coders tried makepp ? Though it doesn't address many issues raised, it might be 1) a step into the right direction while maintaining compatibility (a cherished concept for us) and 2) Might serve as a base one day, as I guess you prefer to modify rather Perl code than C code.
I just hope you won't embark into the LISP/Scheme/Prolog direction. Whereas it might be tempting to write a completely new make system, something that keeps at least some superficial compatibility can lure many more people into trying it out than a new system that we would have to learn from scratch.

Would you mind posting some results from your discussion to the frontpage of biowiki or here, for those people that don't live in the bay area?


make alternatives

There's also a page on freshmeat with a collection of make alternatives. For Python fans: Scons.org


makeovers

We certainly would want the replacement to look almost (if not exactly) like make. It should definitely be as simple as make -- that's probably the primary design goal: that you can cut and paste from the command line to a pipeline description file, with minimal extra typing. I'm fully aware of the dangers of excessive re-engineering...

Thanks for the other links... we'll certainly look into them, and post discussion summaries on biowiki.

http://biowiki.org/IanHolmes



Warning: Table './nodalpoint/watchdog' is marked as crashed and last (automatic?) repair failed query: watchdog INSERT INTO watchdog (uid, type, message, severity, link, location, referer, hostname, timestamp) VALUES (0, 'php', '<em>Table &amp;#039;./nodalpoint/accesslog&amp;#039; is marked as crashed and last (automatic?) repair failed\nquery: statistics_exit\nINSERT INTO accesslog (title, path, url, hostname, uid, sid, timer, timestamp) values(&amp;#039;A pipeline is a makefile&amp;#039;, &amp;#039;node/2178&amp;#039;, &amp;#039;&amp;#039;, &amp;#039;54.234.60.133&amp;#039;, 0, &amp;#039;b823792c9f657654b26e3ecd4122c19c&amp;#039;, 342, 1398273831)</em> in <em>/var/www/archive.nodalpoint.org/htdocs/includes/database.mysql.inc</em> on line <em>172</em>.', 2, '', 'http://archive.nodalpoint.org/2007/03/18/a_pipeline_is_a_makefile', '', '54.234.60.133', 1398273831) in /var/www/archive.nodalpoint.org/htdocs/includes/database.mysql.inc on line 172