Pipelines: Overview and design issues


This page is still very much under construction ...

The basic definition of a pipelining system is an environment in which large amounts of data are pushed through a series of processing stages which are linked together through an array of data dependencies. Ideally this whole process takes place in a parallel processing environment, though that is by no means necessary.

The design of most pipelines usually proceeds along the following steps:

  1. Designation of the source data
  2. Outline of the required processing steps
  3. Design of the pipeline
  4. Establishment of a quarantine environment
  5. First run of pipeline
  6. Pipeline rerun (often many times) as new data comes in
  7. Analysis of data

Designation of the source data

Source data for a new pipeline comes primarily in two formats: preexisting data which exists before the pipeline is ever designed, and data that is being acquired as the pipeline is being written (or even thereafter).

If the data already exists, this stage tends to be fairly trivial. The decisions that have to be made is whether one is going to use the entirety of the data, or just a subset thereof. If it is only a subset, it tends to make sense to create a new directory tree for the source data and to copy or link the original data into it. The other option is to create a series of scripts or files which hold the name of all of the files that one wants to use and to parse them as input to the pipeline.

If the data is being acquired during the lifetime of the pipeline, one might also have to consider conversion of the scans from the native format of the scanner to the minc file format. This step is usually not part of the main pipeline, but is instead run separately as the data arrives. It makes sense to have a preexisting directory structure into which the data can be imported.

Outline of the required processing steps

After designating the source data to be used by the pipeline it is time to lay out the conceptual framework behind the pipeline. What this effectively involves is a series if flowcharts and diagrams which illustrate the data-flow and desired results/outputs. Established here are then the data-dependencies, the data.

Design of the pipeline

Establishment of a quarantine environment

First run of a pipeline

Pipeline rerun (often many times) as new data comes in

Analysis of data