Creating a Pipeline

Note This guide assumes some knowledge about PipelineTasks, and so if you would like you can check out Creating a PipelineTask for info on what a PipelineTask is and how to make one. Otherwise, this guide attempts to be mostly stand alone, and should be readable with minimal references.


PipelineTasks are bits of algorithmic code that define what data they need as input, what they will produce as an output, and a run method which produces this output. Pipelines are high level documents that create a specification that is used to run one or more PipelineTasks. This how-to guide guide will introduce you to the basic syntax of a Pipeline document, and progressively take you through; configuring tasks, verifying configuration, specifying subsets of tasks, creating Pipelines using composition, a basic introduction to options when running Pipelines, and discussing common conventions when creating Pipelines.

A Basic Pipeline

Pipeline documents are written using yaml syntax. If you are unfamiliar with yaml, there are many guides across the internet, but the basic idea is that it is a simple markup language to describe key, value mappings, and lists of values (which may be further mappings).

Pipelines have two required keys, description and tasks. The value associated with the description should provide a reader an understanding of what the Pipeline is intended to do and is written as plain text.

The second required key is tasks. Unlike description, which has plain text as a value, the ‘value’ associated with the tasks keys is another key-value mapping. This section defines what work this pipeline will do. The keys of the inner mapping are labels which will be used to refer to an individual task. These labels can be any name you choose, the only restriction is that they must be unique amongst all the tasks. The values in this mapping can be a number of things, which you will see through the course of this guide, but the most basic is a string that gives the fully qualified PipelineTask that is to be run.

This is a lot of text to digest, so take a look at the following example as a ‘picture’ is worth a thousand words.

description: A demo pipeline in the how-to guide
tasks:
  characterizeImage:  lsst.pipe.tasks.characterizeImage.CharacterizeImageTask

Writing this and saving it to a file with a .yaml extension is all it takes to have a simple pipeline. The description reflects that this Pipeline is intended just for this guide. The tasks section contains only one entry. The label used for this entry, characterizeImage, happens to match the module of the task it points to. It could have been anything, but the name was suitably descriptive, so it was a good choice.

If run, this Pipeline would execute CharacterizeImageTask processing the datasets declared in that task, and write the declared outputs.

Having a pipeline to run a single PipelineTask does not seem very useful. The examples below (and in subsequent sections) are a bit more realistic.

description: A demo pipeline in the how-to guide
tasks:
  isr: lsst.ip.isr.IsrTask
  characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
  calibrate: lsst.pipe.tasks.calibrate.CalibrateTask

This Pipeline contains 3 tasks to run, all of which are steps in processing a single frame. The order that the tasks are executed is not determined by the ordering of the tasks in the pipeline, but by the definition/configuration of the PipelineTasks. It is the job of the execution system to work out this order, as such you may write the tasks in any order in the Pipeline. The following Pipeline is exactly the same from an execution point of view. With that said, be kind to human readers and if possible write the tasks in the order you expect the tasks to execute most often so readers can gain some intuition.

description: A demo pipeline in the how-to guide
tasks:
  characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
  calibrate: lsst.pipe.tasks.calibrate.CalibrateTask
  isr: lsst.ip.isr.IsrTask

Tasks define their inputs and outputs which are used to construct an execution graph of the specified tasks. A consequence of this is if a pipeline does not define all the tasks required to generate all needed inputs and outputs it will get caught before any execution occurs.

Configuring Tasks

PipelineTasks (and their subtasks) contain a multitude of configuration options that alter the way the task executes. Because Pipelines are designed to do a specific type of processing (per the description field) some tasks may need specific configurations set to enable/disable behavior in the context of the specific Pipeline.

To configure a task associated with a particular label, the value associated with the label must be changed from the qualified task name to a new sub-mapping. This new sub mapping should have two keys, class and config.

The class key should point to the same qualified task name as before. The value associated with the config keyword is itself a mapping where configuration overrides are declared. The example below shows this behavior in action.

description: A demo pipeline in the how-to guide
tasks:
  isr:
    class: lsst.ip.isr.IsrTask
    config:
      doVignette: true
      vignetteValue: 0.0
  characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
  calibrate:
    class: lsst.pipe.tasks.calibrate.CalibrateTask
    config:
      astrometry.matcher.maxOffsetPix: 300

This example shows the Pipeline from the previous section with configuration overrides applied to two of the tasks. The label isr is now associated with the keys class and config. The class location is associated with class keyword instead of the label directly. The config keyword is associated with various Fields and the configuration appropriate for this Pipeline specified as an additional yaml mapping.

The complete complexity of lsst.pex.config can’t be represented with simple yaml mapping syntax. To account for this, config blocks in Pipelines support two special fields: file and python.

The file key may be associated with either a single value pointing to a filesystem path where a lsst.pex.config file can be found, or a yaml list of such paths. The file paths can contain environment variables that will be expanded prior to loading the file(s). These files will then be applied to the task during configuration time to override any default values.

Sometimes configuration is too complex to express with yaml syntax, yet it is simple enough that it does not warrant its own config file. The python key is designed to support this use case. The value associated with the key is a (possibly multi-line) string with valid python syntax. This string is evaluated and applied during task configuration exactly as if it had been written in a file or typed out in an interpreter. The following example expands the previous one to use the python key.

description: A demo pipeline in the how-to guide
tasks:
  isr:
    class: lsst.ip.isr.IsrTask
    config:
      doVignette: true
      vignetteValue: 0.0
  characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
  calibrate:
    class: lsst.pipe.tasks.calibrate.CalibrateTask
    config:
      astrometry.matcher.maxOffsetPix: 300
      python: |
        flags = ['base_PixelFlags_flag_edge', 'base_PixelFlags_flag_saturated', 'base_PsfFlux_flags']
        config.calibrate.astrometry.sourceSelector['references'].flags.bad = flags

Parameters

As you saw in the pervious section, each task defined in a Pipeline may have its own configuration. However, it is sometimes useful for configuration fields in multiple tasks to share the same value. Pipelines support this with a concept called parameters. This is a top level section in the Pipeline document specified with a key named parameters.

The parameters section is a mapping of key-value pairs. The keys can then be used throughout the document in the key-value section of config blocks instead of using of the concrete parameter value.

To make this a bit clearer take a look at the following example, making note that only config fields relevant for this example are shown.

parameters:
  calibratedSingleFrame: calexp
tasks:
  calibrate:
    class: lsst.pipe.tasks.calibrate.CalibrateTask
    config:
      connections.outputExposure = parameters.calibratedSingleFrame
  makeWarp:
    class: lsst.pipe.tasks.makeCoaddTempExp.MakeWarpTask
    config:
      connections.calExpList = parameters.calibratedSingleFrame
  forcedPhotCcd:
    class: lsst.meas.base.forcedPhotCcd.ForcedPhotCcdTask
    config:
      connections.exposure = parameters.calibratedSingleFrame

The above example used parameters to link the dataset type names for multiple tasks, but parameters can be used anywhere that more than one config field use the same value, it is not restricted to dataset types.

Command line options for running Pipelines introduces how to run Pipelines and will talk about how to dynamically set a parameters value at Pipeline invocation time.

Verifying Configuration: Contracts

The Config classes associated with Tasks provide a method named verify which can be used to verify that all supplied configuration is valid. These verify methods however, are shared by every instance of the config class. This means they can not be specialized for the context in which the task is being used.

When writing Pipelines it is sometimes important to verify that configuration values are either set in such a way to ensure expected behavior, and/or consistently set between one or more tasks. Pipelines support this sort of verification with a concept called contracts. These contracts are useful for ensuring two separate config fields are set to the same value, or ensuring a config parameter is set to a required value in the context of this pipeline. Because configuration values can be set anywhere from the Pipeline definition to the command-line invocation of the pipeline, these contracts ensure that required configuration is appropriate prior to execution.

contracts are expressions written with Python syntax that should evaluate to a boolean value. If any contract evaluates to false, the Pipeline configuration is deemed to be inconsistent, an error is raised, and execution of the Pipeline is halted.

Defining contracts involves adding a new top level key to your document named contracts. The value associated with this key is a yaml list of individual contracts. Each list member may either be the contract expression or a mapping that contains the expression and a message to include with an exception if the contract is violated. If the contract is defined as a mapping, the expression is associated with a key named contract and the message is a simple string associated with a key named msg.

The expressions in the contracts section reference configuration parameters for one or more tasks identified by the assigned label in the task section. The syntax is similar to that of a task config override file where the config variable is replaced with the task label associated with the task to configure. An example contract to go along with our above pipeline would be as follows:

contracts:
  - characterizeImage.applyApCorr == calibrate.applyApCorr"

This same contract can be defined in a mapping with an associated message as below:

contracts:
  - contract: "characterizeImage.applyApCorr ==\
               calibrate.applyApCorr"
    msg: "The aperture correction sub tasks are not consistent"

It is important to note how contracts relate to parameters. While a parameter can be used to set two configuration variables to the same value at the time Pipeline definition is read, it does not offer any validation. It is possible for someone to change the configuration of one of the fields before a Pipeline is run. Because of this, contracts should always be written without regards to how parameters are used.

Subsets

Pipelines are the definition of a processing workflow from some input data products to some output data products. Frequently, however, there are sub units within a Pipeline that define a useful unit of the Pipeline to run on their own. This may be something like processing single frames only.

You, as the author of the Pipeline, can define one or more of the processing units by creating a section in your Pipeline named subsets. The value associated with the subsets key is a new mapping. The keys of this mapping will be the labels used to refer to an individual subset. The values of this mapping can either be a yaml list of the tasks labels to be associated with this subset, or another yaml mapping. If it is the latter, the keys must be subset, which is associated with the yaml list of task labels, and description, which is associated with a descriptive message of what the subset is meant to do. Take a look at the following two examples which show the same subset defined in both styles.

subsets:
  processCcd:
    - isr
    - characterizeImage
    - calibrate
subsets:
  processCcd:
    subset:
      - isr
      - characterizeImage
      - calibrate
    description: A set of tasks to run when doing single frame processing

Once a subset is created the label associated with it can be used in any context where task labels are accepted. Examples of this will be shown in Command line options for running Pipelines.

Steps

Subsets are designed to be an encapsulation of a collection of tasks that are useful to be run together. These runs could be for reasons as small as a producing quick set of QA debugging plots or as large as divisions of the complete pipeline for survey level production.

Because divisions of the pipeline have a special place in survey production, pipelines have a special place where they are highlighted, the steps key. This key is a list of the labels for all subsets that are considered steps in end to end processing. Alongside each label, a step must be declared with the set of dimensions the step is expected to run over. An example of what the step syntax looks like can be seen in the example below. These bits of information allow campaign management / batch production software better reason about how to handle the processing workflow found withing a pipeline.

steps:
  # the label corresponding to a declared subset, and the dimensions
  # the processing of that subset is expected to take
  - label: step1
    sharding_dimensions: visit, detector
  - label: step2
    sharding_dimensions: tract, patch, skymap

Importing

Similar to subsets, which allow defining useful units within a Pipeline, it’s sometimes useful to construct a Pipeline out of other Pipelines. This is known as importing a Pipeline.

Importing other pipelines begins with a top level key named imports. The value associated with this key is a yaml list. The values of this list may be strings corresponding to a filesystem path of the Pipeline to import. These paths may contain environment variables to help in writing paths in a platform agnostic way.

Alternatively, the elements of the imports list may be yaml mapping. This mapping begins with a key named location who’s value is the same as the path described above. The mapping can optionally contain the keys include, exclude, and importContracts. The keys include and exclude can be used to specify which labels (or labeled subsets) to include or exclude, respectively, when inheriting a Pipeline. The values associated with these keys are specified as a yaml list, and these two keys are mutually exclusive, only one can be specified at at time. The importContracts key is optional and is associated with a boolean value that controls wether contracts from the imported pipeline should be included when importing, with a default value of true.

A few further notes about including and excluding. When specifying labels with include or exclude, it is possible to use a labeled subset in place of a label. This will have the same effect as typing out all of the labels listed in the subset. Another important point is the behavior of labels that are not imported (either because they are excluded, or they are not part of the include list). If any omitted label appears as part of a subset, then the subset definition is not imported by default. This can be changed by adding the labeledSubsetModifyMode key to the import definition and setting its value to EDIT. This will cause the subset to still be imported but task labels which are missing (due to include / exclude specifications) to be removed from the subset. A warning about this EDIT behavior, using it can leave subsets in states where the subsets are no longer logically sound as some required task could be missing. Because of this, this mode should only be used when one has a good understanding of the Pipelines in question and the consequences of the decision.

The order that Pipelines are listed in the imports section is not important. Another thing to note is that declared labels must be unique amongst all inherited Pipelines.

Once one or more pipelines is imported, the tasks section is processed. If any new labels (and thus PipelineTasks) are declared they simply extend the total Pipeline.

If a label declared in the the tasks section was declared in one of the imported Pipelines, one of two things happen. If the label is associated with the same PipelineTask that was declared in the imported pipeline, this definition will be extended. This means that any configs declared in the imported Pipeline will be merged with configs declared in the current Pipeline with the current declaration taking config precedence. This behavior allows tasks to be extended in the current Pipeline.

If the label declared in the current Pipeline is associated with a different PipelineTask than the label in the imported declaration, then the label with be considered re-declared and the declaration in the current Pipeline will be used. The declaration defined in the imported Pipeline is dropped.

obs_* package overrides for Pipelines

Pipelines support automatically loading Task configuration files defined in obs packages. A top level key named instrument is associated with a string representing the fully qualified class name of the python camera object. For instance, for an obs_subaru Pipeline this would look like:

instrument: lsst.obs.subaru.HyperSuprimeCam

The instrument key is available to all Pipelines, but by convention obs_* packages typically will contain Pipelines that are customized for the instrument they represent, inside a directory named ‘’pipelines’’. This includes relevant configs, PipelineTask (re)declarations, instrument label, etc. These pipelines can be found inside a directory named Pipelines that lives at the root of each obs_ package.

These Pipelines enable you to run a Pipeline that is configured for the desired camera, or can serve as a base for further Pipelines to import.

Command line options for running Pipelines

This section is not intended to serve as a tutorial for processing data from the command line, for that refer to lsst.ctrl.mpexec or lsst.ctrl.bps. However, both of these tools accept URI pointers to a Pipeline. These URIs can be altered with a specific syntax which will control how the Pipeline is loaded.

The simplest form of a Pipeline specification is the URI at which the Pipeline can be found. This URI may be any supported by lsst.resources.ResourcePath. In the case that the pipeline resides in a file located on a filesystem accessible by the machine that will be processing the Pipeline (i.e. a file URI), there is no need to preface the URI with file://, a bare file path is assumed to be a file based URI.

File based URIs also support shell variable expansion. If, for instance, the URI contains $ENV_VAR, the variable will be expanded prior to evaluating the path. A file based URI to a pipeline in an lsst package directory would look something like:

$PIPE_TASKS_DIR/pipelines/DRP.yaml

As an example of an alternative URI, here is one based on s3 storage:

s3://some_bucket/pipelines/DRP.yaml

For any type of URI, Pipelines may be specified with additional parameters specified after a # symbol. The most basic parameter is simply a label. Loading a Pipeline with this label specified will cause only this label to be loaded. It will be as if the Pipeline only contained that label. This is useful when you want to run only one PipelineTask out of an existing pipeline. Other fields such as contracts (that only contain a reference to the supplied label) and instrument will also be loaded. Using the example above, a URI of this form would look something like:

$PIPE_TASKS_DIR/pipelines/DRP.yaml#characterizeImage

Akin to loading a single label, multiple labels may be specified by separating each one with a comma like in the following example.

$PIPE_TASKS_DIR/pipelines/DRP.yaml#isr,characterizeImage,calibrate

Make note, when supplying labels in this way, it is possible to supply a list of labels who’s task do not define a complete processing flow. While there is nothing wrong with the Pipeline itself, there may be missing data products due to the task that produces it not being in the list of labels to run. When this happens no processing will be able to be done, and you should look at the input and output dataset types for each of tasks corresponding to the labels specified.

As mentioned above, subsets are useful for specifying multiple labels at one time. As such labeled subsets can be used in the parameter list to indicate all labels associated with the subset should be run. Also like labels, multiple labeled subsets can be used by separating each with a comma. As you can see labeled subsets are essentially synonymous with labels and can be used interchangeably. As such, it is also possible to mix task labels and labeled subsets when specifying a parameter list. A Pipeline URI that specifies a subset based on our previous example would look like the following:

$PIPE_TASKS_DIR/pipelines/DRP.yaml#processCcd

Pipeline conventions

Below is a list of conventions that are commonly used when writing Pipelines. These are not hard requirements, but their use helps maintain consistency throughout the software stack.

  • The name of a Pipeline file should follow class naming conventions (camel case with first letter capital).

  • Preface a Pipeline name with an underscore if it is not intended to be inherited and or run directly, which is referred to as a private pipeline (it is part of a larger pipeline).

  • Use inheritance to avoid really long documents, using ‘private’ Pipelines named as above.

  • Pipelines should contain a useful description of what the Pipeline is intended to do.

  • Pipelines should be placed in a directory called Pipelines at the top level of a package.

  • Instrument packages should provide Pipelines that override standard Pipelines and are specifically configured for that instrument (if applicable).