Many tools for building pipelines are available now. So why MXP?
The answer is that MXP has a number of distinguishing features, and it is better suited for its application domain.
Here we discuss MXP features.
Directories as targets
Probably, it is the most distinguishing MXP feature. All other frameworks that we are aware of are file-based. Each pipeline step takes a number of files as its input, and produces another files.
It is relatively easy to describe pipeline steps that take a single file as its input and produces a single file. However, it is rarely the case in bioinformatics applications. Usually, a set of files is an input, and a set of files is produced as output (for example, PLINK takes a triplet of files .bed
, .bim
, and .fam
as input, while a set of files produced as output heavily depends on operation performed). Moreover, a set of input files may depend on a subtle details of operation (for example, creating a subset of data with PLINK may or may not require files containing list of samples and/or list of markers). Having a directory(ies) as input and output for a step allows us easily manage arbitrary sets of files.
With input and output files grouped together in directories, it is easy to find out what files were used and what files are produced during a pipeline step. With file-oriented tools it is possible, of course, to place files in subdirectories — but it should be done manually, and this still does not eliminate the need to specify all input/output files of a step.
Finally, it gives a clear answer to the question where to store logs and other supplementary files.
Script-based control over what should be re-executed
When MXP decides whether a step of a pipeline should be re-executed, it checks (1) whether input data were changed, and (2) whether the scripts used to obtain a result where changed. The second is a distinguishing MXP feature.
In the case of software development (traditional domain of GNU Make), the programmer modifies source code and then need to recompile what was changed. Thus, changes in input files determine what should be done to obtain the final result.
In contrast, in bioinformatics applications files are almost never directly changed by a user. The initial files usually fixed from the point of view of pipeline, as they are results of hardware process, like next generation sequencing, and changes in subsequent files are results of changing parameters of software used to obtain them. The user changes such parameters, and hopes that pipeline will catch these changes and re-do the affected work.
MXP achieves this by storing a copy of all scripts used to obtain a target in the target directory.
Replacing pieces of standard pipelines
MXP allows to use standardized pipelines with minor modifications, say, with changed parameters for quality control. The standard pipeline code may be in read-only location for the user; nevertheless, modifications are possible.
Bash as a language for pipeline description
MXP is based on Bash. This decision was influenced by many consideration.
First, it is not a good idea to create a proprietary language. It never will be as developed as commonly used languages, and this fact sooner or later will force the end user to reject the tool. (Some pieces of proprietary language in one or another form will be involved in any pipeline tool — the user needs a way to define domain-specific notions. In MXP, this is Makefile language. But we tried our best to allow it to cooperate freely with Bash.)
Second, there are a number of scripting languages available that supersede Bash in clarity of syntax and available data structures (Python is an example). The cryptic nature of Bash is definitely a downside.
But most of the scripting in pipeline development is specifying how external application should be invoked, and this is eventually done in Bash. Note that other pipeline development tools that are based on different languages need to introduce a construct like “shellcommand”, which allows to insert pieces of Bash code.
The last was the most important reason why we choose Bash.
Other features
Among other features that are implemented in MXP as well in the other tools for building pipelines we would like to mention the following ones.
Convenient logging: MXP always creates a full log of a run, and stores it for future investigations. There is no need for the user to specify how to do logging; it is done automatically, and the log always is stored at the same place in the target directory.
Easy way to publish pipelines: all code of a pipeline is stored in mxp
subdirectory. Thus, publishing consists of compressing and publishing the contents of this directory. Anyone (who has access to input data) may download and re-run this code.
The fact that MXP stores a copy of all scripts in the target directory gives the user additional ability to re-examine at any time the way how the target was obtained — regardless of whether code was later changed or not.