RPPL: Brain Imaging Centre Pipelining System
Just to avoid some confusion, rppl is also often referred to as PCS, for Production (or Process) Control System. The two terms can be used interchangeably, though I have chosen to stick with rppl for the simple reason that that is the name of the programme one actually calls.
To give credit where credit is due, the pipelining system described
below was written and designed by Alberto Jimenez and Alex Zijdenbos.
The pipeline itself is usually composed of the following stages:
Here is an example call to rppl:
rppl | The programme name |
-c | Create a new pipeline. The other commonly used options are -r, which reworks an already existing pipeline, and -cb, which clobber an exisiting pipeline and then recreates it |
-p filename | The name of the pipeline configuration file to run. By convention these file end in .cfg |
-f file pattern | The -f option specifies which input files to use. Actually, that isn't quite correct. -f requires one or more strings, which will then be parsed to create the specific id variable. A common way of doing this is to specify a file pattern, let the shell take care of globbing that pattern, then parsing the files for the id in the configuration file. |
Name | Description |
---|---|
MSQL_HOME | The path to the installed directory of msql. |
PCS_HOME | The home directory of rppl |
RPPL_INCLUDES | I have no idea what this is for |
PATH | has to include the PCS/bin and PCS/tools directories |
PERL5LIB | has to include PCS/perllib directory |
--user_exp $ID = ($file_name =~ s/.+mni_icbm_(\d\d\d\d\d)_t1_final.mnc/$1/) --dsid sprintf("%s", $file_name);
Command | Description |
---|---|
--defvar | Usually used to operate on a variable. For example, to initialise a variable you would type --defvar $foo = "bar"; Any perl code is legal here, so variable can be modified with regular expressions, concatenations, etc. |
--user_exp | The initial user expression supplied by the -f option to rppl. If the -f option results in multiple commands/files, the --user_exp option will refer to the one currently being evaluated. |
--dsid | The dataset ID for this particular instance of the input filenames. The value set here can later be referred to by the $dsid variable. The most common way to set the variable is by the print to string function sprintf |
The example above
--user_exp $ID = ($file_name =~ s/.+mni_icbm_(\d\d\d\d\d)_t1_final.mnc/$1/) --dsid sprintf("%s", $file_name);
Thus takes the input file name (--user_exp), substitutes it to only
use the five numbers in the middle of the filename, then prints those five
numbers to the --dsid variable.
# output directories --defvar $baseDir = "/data/rome/jason/pitt/pipeline/${dsid}"; --defvar $transformsDir = "${baseDir}/transforms"; --defvar $linearDir = "${transformsDir}/linear"; --defvar $finalDir = "${baseDir}/final"; --defvar $classifyDir = "${baseDir}/classify"; --parse system("mkdir -p $baseDir") if (! -d $baseDir); --parse system("mkdir -p $transformsDir") if (! -d $transformsDir); --parse system("mkdir -p $linearDir") if (! -d $linearDir); --parse system("mkdir -p $finalDir") if (! -d $finalDir); --parse system("mkdir -p $classifyDir") if (! -d $classifyDir); # files --defvar $nativeT1 = "/data/rome/jason/pitt/native/${dsid}.mnc"; --defvar $talXfm = "${linearDir}/${dsid}_tal.xfm"; --defvar $talMnc = "${finalDir}/${dsid}_tal.mnc"; --defvar $classify = "${classifyDir}/${dsid}_cls.mnc";
--defvar $BatchDefOpt = sprintf ("-J %s:%s -m cr", $dsid, $stage_name); --defvar $BatchLogOpt = sprintf ("-k -o %s/%s_%s.log", $LogDir, $dsid, $stage_name); --batch "$BatchDefOpt $BatchLogOpt"; --batchqueue "medium";
Command | Description |
---|---|
--stageName | The name by which rppl will refer to this stage. For example, --stageName "mritotal" |
--program | The actual programme that will be called here. If this programme is not part of your Path (as defined by the time that rppl is actually invoked), then you need to specify the full path here. Also note that --program refers to the programme only, not to any options - those will be described later. E.g --program "mritotal"; |
--infiles | The input files to the programme. Conventionally, the --infiles tag is handled by creating a new variable and assigning it the input file, e.g. --infiles $source = $somfile;. Furthermore, it is common to only specify one file per --infiles tag, as you can reuse that tag multiple times for each pipeline stage. |
--outfiles | Same as --infiles, except this applies for the files that will be output from the programme. |
--files | This is where one combines the input and output files in the way that will actually be passed to the programme. E.g. --files "$source $dest" |
--options | The options which are to be passed to the programme. For example, if this stage is running gzip, you might want to add --options "-f"; |
--post_actions | Any actions to perform after the step has completed. --post_actions is most commonly used to compress the output files of any particular stage |
--prqst | The prerequisites that must be satisfied before this stage can be run. It is the --prqst option which gives rppl its power, and allows it to take advantage of the coarse parallelism provided by the batch queueing system. The syntax for --prqst is a little different, however. Here's an example line: --prqst ${dsid}: step_mincresample_t1 SUCCEED and step_mincresample_t1 SUCCEED. The first part of the expression, ${dsid}, provides the context for rppl to interpret the subsequent step definitions. This means that the following steps have to have succeed for this dataset ID. In almost all cases this will be the dsid of the current file being processed. After that come the names of the steps (taken from their --stageName variable) preceded by "step_", and the condition that is to be met (currently only SUCCEED is supported). More than criteria can be required with the "and" keyword. Also note that this is the only part of the pipeline definition which does not require a semi-colon at the end. |
This condensed overview provided above might leave one a bit out of
breath, so I will try to illustrate how to create a step in a pipeline
with an example. Take the following code:
--stageName "brain_mask"; --program "/usr/people/alex/AI_ALQUA/bin/msd_masks"; --infiles $source = $Cls; --infiles $surface = $symSurface; --outfiles $mask = $symMask; --outfiles $dest = $clsMasked; --files "$source $surface $mask"; --options "-clobber -masked_input $dest"; --post_actions "gzip -f $dest"; --prqst ${dsid}: step_classify SUCCEED and step_transform_mask SUCCEED
After defining the name of the current stage, one needs to specify the programme that will actually be run. Needed here is only the programme name, no options. The full path to the programme is only necessary if that programme does not live in the $PATH environment variable.
Once the name and programme are defined, the input and output files have to be specified. The best way of accomplishing this is by creating local variables, which in turn refer to the filenames/variables created in the first part of the pipeline. That, in fact, is what is done in the little example above ... though there is nothing stopping anyone from simply providing the filenames as a string there and then.
The --files tag then combines the in and out files in a way that the programme being called will understand. One thus has to know a little bit about the programme being called, though normally the infiles are all specified before the outfiles. Lastly, --options are the options passed on to the programme, and the final formatting of any programme call by rppl is (program name) (options) (files).
The --post_actions tag defines what will be done after the stage has finished running, and is therefore most commonly used for cleanup and compression tasks. The --prqst tag defines the relationship between stages, and is interpreted along the lines of: "run this stage when the following stages have completed successfully."
There are three options for running a pipeline: creating a new pipeline,
rerunning an already existing pipeline, and clobbering and then rerunning
an already existing pipeline. These options are, respectively, -r -c and
-cb. rppl tends to be good about intelligently rerunnign pipelines, so
if you have an existing pipeline but have added a few stages to it, it
will run just those new stages when -r is specified. If, however, there
were a few minor changes made to already existing stages, one has two options:
rerunning the entire pipeline with the clobber option (-cb), or with the
following command:
Note that there is a bug when trying to run rppl in the background. See a workaround here