RPPL: Brain Imaging Centre Pipelining System


RPPL (Run Pipeline in Parallel) is a pipelining system designed to run a series of analyses/transformations on a large data-set. As such it lets you specify the individual stages in the pipeline and create dependencies between these stages. What follows is a brief tutorial describing how to set up and run such a pipeline while hopefully avoiding some of the pitfalls associated with this system.

 Just to avoid some confusion, rppl is also often referred to as PCS, for Production (or Process) Control System. The two terms can be used interchangeably, though I have chosen to stick with rppl for the simple reason that that is the name of the programme one actually calls.

 To give credit where credit is due, the pipelining system described below was written and designed by Alberto Jimenez and Alex Zijdenbos.
 
 

Overview

rppl is based on the perl programming language, and knowing (or at least having a rough understanding of) perl can almost be considered a prerequisite for using rppl. Each stage of a pipeline consists of a program name to run, a stage name, options to be passed to that programme, and the files needed by that programme. These separate elements are taken to build the command itself, which is submitted via the UCLA Batch Queueing system. Furthermore, the status of all running pipelines is tracked via a msql database, allowing the user to keep track of failures in a pipeline and rerunning just those failures.

 The pipeline itself is usually composed of the following stages:

  1. Parsing the input files
  2. Setting the variables to be used in the programmes. This again has two main sub-stages:
    1. Setting all of the filenames to be used by the programmes that rppl calls. If, for example, the pipeline calls mritotal followed by mincresample, you'll want to set the filename of the xfm file which will result from the call to mritotal here, as well as the name of the mincfile that mincresample will produce.
    2. The other variables one can set here are the options with which the batch queueing system will be called.
  3. Defining each of the stages in the pipeline itself. I.e., these are the calls to the programmes which are to be run against the input files.
Just so that some of the subsequent explanations make sense, I'm going to give a quick summary of how to invoke rppl. Please see the sections "creating the environment" and "running rppl" for more details.

 Here is an example call to rppl:

rppl -c -p ../src/tms/asymetry.cfg -f mni_icbm_?????_t1*
... and a quick rundown of the above invokation:
 
 
rppl  The programme name
-c  Create a new pipeline. The other commonly used options are -r, which reworks an already existing pipeline, and -cb, which clobber an exisiting pipeline and then recreates it
-p filename The name of the pipeline configuration file to run. By convention these file end in .cfg
-f file pattern The -f option specifies which input files to use. Actually, that isn't quite correct. -f requires one or more strings, which will then be parsed to create the specific id variable. A common way of doing this is to specify a file pattern, let the shell take care of globbing that pattern, then parsing the files for the id in the configuration file.

Creating the environment

There are several environment variables that have to be set before one can begin using rppl. The necessary variables are tabulated below:
 
 
Name  Description
MSQL_HOME  The path to the installed directory of msql.
PCS_HOME  The home directory of rppl
RPPL_INCLUDES  I have no idea what this is for
PATH  has to include the PCS/bin and PCS/tools directories
PERL5LIB  has to include PCS/perllib directory

rppl syntax

rppl the pipeline configuration file (specified by -p) for determining its command set. Each command in the configuration file begins with -- (two dashes) followed by the particular command and its options, e.g. --defvar $foo = "bar";. Just as in perl, each line must be terminated with a semi-colon. Again, as in perl, comments are preceded by the number sign (#).

Setting the variables

This is the stage of the pipeline where all of the filenames to be used are set, where the input files are parsed, and where the variable for running batch are initialised.
Parsing the input files
The one variable that is essential for the operation of rppl is $dsid, which stands for Data-Set IDentification. This variable thus has to be unique for each of the files that are being run through a particular pipeline. The most common way to set the $dsid variable is by parsing the input filenames (specified by the -f option) and extracting the unique identifier from them. An example of doing just that is created below:
 
 
--user_exp $ID = ($file_name =~ s/.+mni_icbm_(\d\d\d\d\d)_t1_final.mnc/$1/)
--dsid  sprintf("%s", $file_name);
Before we go any deeper into analysing the statement above, it might be useful to provide a quick overview of all of the rppl commands applicable to this part of constructing a pipeline:
 
 
Command Description
--defvar Usually used to operate on a variable. For example, to initialise a variable you would type --defvar $foo = "bar"; Any perl code is legal here, so variable can be modified with regular expressions, concatenations, etc.
--user_exp  The initial user expression supplied by the -f option to rppl. If the -f option results in multiple commands/files, the --user_exp option will refer to the one currently being evaluated.
--dsid  The dataset ID for this particular instance of the input filenames. The value set here can later be referred to by the $dsid variable. The most common way to set the variable is by the print to string function sprintf

The example above

--user_exp $ID = ($file_name =~ s/.+mni_icbm_(\d\d\d\d\d)_t1_final.mnc/$1/)
--dsid  sprintf("%s", $file_name);


Thus takes the input file name (--user_exp), substitutes it to only use the five numbers in the middle of the filename, then prints those five numbers to the --dsid variable.
 
 

Setting the filenames and other variable to be used
The next step in creating most pipelines is to set all of the names of the files that will be created during the pipeline process. This is accomplished through the following syntax:
 
 
--defvar $variable = "value";
The best way to create an effective file and directory structure is by breaking it down into its smalles component parts. In other words, start by creating a variable for the base directory, then for each of the subdirectories (including the basedir as its first component), and then for each of the files. On a similar note, it is good practice to add a check if the directories and necessary files actually exist, and, where applicable, to create them if they don't. The following little code snippet illustrates how one might set up variables:
 
 
# output directories
--defvar $baseDir = "/data/rome/jason/pitt/pipeline/${dsid}";
--defvar $transformsDir = "${baseDir}/transforms";
--defvar $linearDir = "${transformsDir}/linear";
--defvar $finalDir = "${baseDir}/final";
--defvar $classifyDir = "${baseDir}/classify";

--parse system("mkdir -p $baseDir") if (! -d $baseDir);
--parse system("mkdir -p $transformsDir") if (! -d $transformsDir);
--parse system("mkdir -p $linearDir") if (! -d $linearDir);
--parse system("mkdir -p $finalDir") if (! -d $finalDir);
--parse system("mkdir -p $classifyDir") if (! -d $classifyDir);

# files
--defvar $nativeT1 = "/data/rome/jason/pitt/native/${dsid}.mnc";
--defvar $talXfm = "${linearDir}/${dsid}_tal.xfm";
--defvar $talMnc = "${finalDir}/${dsid}_tal.mnc";
--defvar $classify = "${classifyDir}/${dsid}_cls.mnc";
Setting the default batch options
rppl submits all of its jobs through the UCSF batch queuing system. One can therefore also control the options with which the batch jobs are submitted. This is done through two rppl variabes: --batch and --batchqueue. Below is an example usage of these two variables:
--defvar         $BatchDefOpt = sprintf ("-J %s:%s -m cr", $dsid, $stage_name);
--defvar         $BatchLogOpt = sprintf ("-k -o %s/%s_%s.log", $LogDir, $dsid, $stage_name);
--batch         "$BatchDefOpt $BatchLogOpt";
--batchqueue    "medium";
Please see the batch man pages for more details on what options batch accepts.
 
 

Building a pipeline stage

Once all of the variables are set, it is time to define the actual pipeline stages. Here are the relevant rppl commands for this part of building a pipeline:
 
 
Command  Description
--stageName  The name by which rppl will refer to this stage. For example, --stageName "mritotal"
--program  The actual programme that will be called here. If this programme is not part of your Path (as defined by the time that rppl is actually invoked), then you need to specify the full path here. Also note that --program refers to the programme only, not to any options - those will be described later. E.g --program "mritotal";
--infiles  The input files to the programme. Conventionally, the --infiles tag is handled by creating a new variable and assigning it the input file, e.g. --infiles $source = $somfile;. Furthermore, it is common to only specify one file per --infiles tag, as you can reuse that tag multiple times for each pipeline stage.
--outfiles  Same as --infiles, except this applies for the files that will be output from the programme.
--files  This is where one combines the input and output files in the way that will actually be passed to the programme. E.g. --files "$source $dest"
--options  The options which are to be passed to the programme. For example, if this stage is running gzip, you might want to add --options "-f";
--post_actions  Any actions to perform after the step has completed. --post_actions is most commonly used to compress the output files of any particular stage
--prqst  The prerequisites that must be satisfied before this stage can be run. It is the --prqst option which gives rppl its power, and allows it to take advantage of the coarse parallelism provided by the batch queueing system. The syntax for --prqst is a little different, however. Here's an example line: --prqst ${dsid}: step_mincresample_t1 SUCCEED and step_mincresample_t1 SUCCEED. The first part of the expression, ${dsid}, provides the context for rppl to interpret the subsequent step definitions. This means that the following steps have to have succeed for this dataset ID. In almost all cases this will be the dsid of the current file being processed. After that come the names of the steps (taken from their --stageName variable) preceded by "step_", and the condition that is to be met (currently only SUCCEED is supported). More than criteria can be required with the "and" keyword. Also note that this is the only part of the pipeline definition which does not require a semi-colon at the end. 

This condensed overview provided above might leave one a bit out of breath, so I will try to illustrate how to create a step in a pipeline with an example. Take the following code:
 
 

--stageName "brain_mask";
--program   "/usr/people/alex/AI_ALQUA/bin/msd_masks";
--infiles   $source = $Cls;
--infiles   $surface = $symSurface;
--outfiles  $mask = $symMask;
--outfiles  $dest = $clsMasked;
--files     "$source $surface $mask";
--options   "-clobber -masked_input $dest";
--post_actions "gzip -f $dest";
--prqst     ${dsid}: step_classify SUCCEED and step_transform_mask SUCCEED
The first part of each stage is providing the name. This value is important is specifying the relationship between the various stages of a pipeline, as well as in forcing the pipeline to run only from a given step onwards. For example, if the following step in the pipeline is dependant on the success of the brain_mask step, it would contain the line --prqst ${dsid}: step_brain_mask SUCCEED.

 After defining the name of the current stage, one needs to specify the programme that will actually be run. Needed here is only the programme name, no options. The full path to the programme is only necessary if that programme does not live in the $PATH environment variable.

 Once the name and programme are defined, the input and output files have to be specified. The best way of accomplishing this is by creating local variables, which in turn refer to the filenames/variables created in the first part of the pipeline. That, in fact, is what is done in the little example above ... though there is nothing stopping anyone from simply providing the filenames as a string there and then.

 The --files tag then combines the in and out files in a way that the programme being called will understand. One thus has to know a little bit about the programme being called, though normally the infiles are all specified before the outfiles. Lastly, --options are the options passed on to the programme, and the final formatting of any programme call by rppl is (program name) (options) (files).

 The --post_actions tag defines what will be done after the stage has finished running, and is therefore most commonly used for cleanup and compression tasks. The --prqst tag defines the relationship between stages, and is interpreted along the lines of: "run this stage when the following stages have completed successfully."

Running rppl

The basics of invoking rppl are fairly trivial, one need only provide the pipeline name, whether to create a new pipeline or rerun an old one, and what ID values to pass to the pipeline.

There are three options for running a pipeline: creating a new pipeline, rerunning an already existing pipeline, and clobbering and then rerunning an already existing pipeline. These options are, respectively, -r -c and -cb. rppl tends to be good about intelligently rerunnign pipelines, so if you have an existing pipeline but have added a few stages to it, it will run just those new stages when -r is specified. If, however, there were a few minor changes made to already existing stages, one has two options: rerunning the entire pipeline with the clobber option (-cb), or with the following command:
 
 

rppl -r -p pipe -f dsids -exec -from step_name
This will re-execute the pipeline from the specified stage name. The only problem here is that all of the programmes which will be called in that process will have to have a -clobber option implemented, else the pipeline will fail.

 Note that there is a bug when trying to run rppl in the background. See a workaround here

An example pipeline

This example pipeline will hopefully bring together the various disparate elements discussed above. The goal of this pipeline is quite simple: to test the quality of the registration procedure employed here at the MNI. Explanations of how the pipeline works are embedded in the code itself.


Pitfalls and Tricks

Under construction ... but here is a list of things (more will surely be added) which I want to address
 
 

Using the -clobber option

Crashes happen ... whether they are related to rppl or a power-failure, there will be times when your pipeline is thrown out of whack for one reason or another. Rppl is quite good at allowing you to deal with such situations, namely through giving you a -exec option or -iof (which decides what stages to run based on the existence and timestamp of the input and output files). These options, however, are still dependant on good pipeline design. This is mainly due to the fact that the programmes called by your pipeline might fail if the output file already exists. It is for that reason that you should call your programmes with the -clobber option whenever possible.
 
 

The Input Bug:

There is a small bug in rppl which occasionally (and only occasionally) stops if from executing correctly when it is backgroundified. What appears to happen is that rppl waits for input, even though it has no good reason to do so. This bug is easily circumvented, however, by piping /dev/null to rppl as input. In order to run a backgroundified process, you thus have to invoke rppl with the following syntax:
 
 
rppl [options] < /dev/null >& /data/somefile.log &

This page was last updated on: March 31, 2000

Please send any questions or comments to jason@bic.mni.mcgill.ca