A pipeline is a sequence of operations that leads to a required result.
What exactly “result” means, and what kind of “operations” are used, depends heavily on application domain. The expected application domain influences the design of a tool for building pipelines.
The famous Unix utility make
(known since 1976) was, probably, the first tool for building pipelines (although the word “pipeline” is rarely used in conjunction with make
). Virtually all of the tools for building pipelines borrow from make
, and MXP is not an exception. But what makes these tools different are the elementary units which the pipeline operates on and the rules that are used to determine whether to re-execute a step or to use its existing results. This difference eventually influences the language used to describe the pipelines (e.g., Makefile
syntax and semantics).
The units which MXP operates on are called (just like in make
) targets. A target is represented by a directory containing an arbitrary set of files (and possibly subdirectories). We often use the word “target” instead more exact term “target directory”.
As in case of make
, the execution of MXP consists of obtaining target specified in the command line. In order to obtain a target, other target(s) may be needed. MXP checks whether the required targets have been already obtained and if they are up-to-date; if not, MXP automatically rebuilds the required targets — which may require other targets, i.e. obtaining the required targets is a recursive process. What targets are required for a given target, and how a given target should be obtained from the other ones is specified in Makefile (again, the term is borrowed from make
).
What is Makefile and how to use it
Makefile consists of rules. In MXP, Makefile is a Bash script. Here is an example of a rule:
MXP_MAKEFILE[d01_pdata]="(idata_DIR = d00_idata) pdata_0 : pdata"
This rule states that:
- target
d01_pdata
requires targetd00_idata
- method
pdata
with parameterspdata_0
should be used to obtain targetd01_pdata
from targetd00_idata
- during execution of method
pdata
environmental variableidata_DIR
will be set to full path to target directoryd00_idata
Also, it implicitly states that:
- there is an analysis directory (current directory or directory explicitly specified in MXP command-line arguments) that contains a subdirectory
mxp
, and a fileMakefile.sh
inside of it - the target directory named
d01_pdata
will be created within the analysis directory as a result of obtaining targetd01_pdata
(or, if this directory already exists, MXP will check whether this directory is up-to-date and rebuild it if it is not) - there is a file
pdata.sh
containing a Bash script that will be executed in order to obtain targetd01_pdata
- there is a file
pdata_0.params.sh
containing a Bash script (that define parameters) that will be executed in order to obtain the targetd01_pdata
Strictly speaking, there is no difference between a parameter script and a method script. MXP introduces this distinction to encourage the pipeline developers to clearly separate parameters from methods. Parameters could be changed by the pipeline user (for example, the user may want to use his own parameters for quality control), while methods are much more stable and are not expected to change from one pipeline application to another.
To determine if the target is up-to-date, MXP will check if:
- the target directory exists
- the last attempt to build target was completed successfully
- all required targets are up-to-date
- the rule used to obtain target has not been updated
- method and parameter scripts used to obtain target have not been updated
Chaining pipelines
An important feature of MXP is that it allows to create new pipelines re-using pieces from existing pipelines. Each pipeline has a parent; only the root pipeline (which is a part of MXP base) does not have parent. Makefile, methods and parameter sets defined in the parent pipeline are available in the child pipeline, and the child pipeline may override exactly those pieces from the parent pipeline that need to be changed. How this works is described in detail in the section “Chaining Pipelines”.
Logging
Another important feature of MXP is logging. When a target is built, a full log is written in the target directory. This log can be examined later to learn how exactly the target was built (in the case of successful build) or find out why the target build failed (in the case of failure).
It is also possible to save a log of a full MXP run, which may involve building multiple targets.
Useful trick
Looking to the example on Makefile rule above, one may wonder why names of targets are d00_idata
and d01_pdata
, but not simply idata
and pdata
?
The answer is simple. Targets are represented by directories, and you often will use command ls
(or ls -l
) to examine what targets you have already. The order in which directories are listed by ls
command is alphabetical. Prefixing directory names by a number ensures a convenient ordering.