Creating a Pipeline¶
Note
This guide assumes some knowledge about
PipelineTask
s, and so if you would like you can check out
Creating a PipelineTask for info on what
a PipelineTask
is and how to make one. Otherwise, this guide attempts to be
mostly stand alone, and should be readable with minimal references.
PipelineTask
s are bits of algorithmic code that define what data they
need as input, what they will produce as an output, and a run
method
which produces this output. Pipeline
s are high level documents that
create a specification that is used to run one or more PipelineTask
s.
This how-to guide guide will introduce you to the basic syntax of a
Pipeline
document, and progressively take you through; configuring tasks,
verifying configuration, specifying subsets of tasks, creating Pipeline
s
using composition, a basic introduction to options when running Pipeline
s, and discussing common conventions when creating Pipelines
.
A Basic Pipeline¶
Pipeline
documents are written using yaml syntax. If you are unfamiliar with
yaml, there are many guides across the internet, but the basic idea is that it
is a simple markup language to describe key, value mappings, and lists of
values (which may be further mappings).
Pipeline
s have two required keys, description
and tasks
. The value
associated with the description
should provide a reader an understanding of
what the Pipeline
is intended to do and is written as plain text.
The second required key is tasks
. Unlike description
, which has plain
text as a value, the ‘value’ associated with the tasks
keys is another
key-value mapping. This section defines what work this pipeline will do. The
keys of the inner mapping are labels which will be used to refer to an
individual task. These labels can be any name you choose, the only
restriction is that they must be unique amongst all the tasks. The values in
this mapping can be a number of things, which you will see through the course
of this guide, but the most basic is a string that gives the fully qualified
PipelineTask
that is to be run.
This is a lot of text to digest, so take a look at the following example as a ‘picture’ is worth a thousand words.
description: A demo pipeline in the how-to guide
tasks:
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
Writing this and saving it to a file with a .yaml extension is all it takes
to have a simple pipeline. The description
reflects that this Pipeline
is intended just for this guide. The tasks
section contains only one
entry. The label used for this entry, characterizeImage
, happens to match
the module of the task it points to. It could have been anything, but the
name was suitably descriptive, so it was a good choice.
If run, this Pipeline
would execute CharacterizeImageTask
processing the
datasets declared in that task, and write the declared outputs.
Having a pipeline to run a single PipelineTask
does not seem very useful.
The examples below (and in subsequent sections) are a bit more realistic.
description: A demo pipeline in the how-to guide
tasks:
isr: lsst.ip.isr.IsrTask
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
calibrate: lsst.pipe.tasks.calibrate.CalibrateTask
This Pipeline
contains 3 tasks to run, all of which are steps in processing
a single frame. The order that the tasks are executed is not determined by
the ordering of the tasks in the pipeline, but by the
definition/configuration of the PipelineTask
s. It is the job of the
execution system to work out this order, as such you may write the tasks in
any order in the Pipeline
. The following Pipeline
is exactly the same
from an execution point of view. With that said, be kind to human readers and
if possible write the tasks in the order you expect the tasks to execute most
often so readers can gain some intuition.
description: A demo pipeline in the how-to guide
tasks:
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
calibrate: lsst.pipe.tasks.calibrate.CalibrateTask
isr: lsst.ip.isr.IsrTask
Tasks define their inputs and outputs which are used to construct an execution graph of the specified tasks. A consequence of this is if a pipeline does not define all the tasks required to generate all needed inputs and outputs it will get caught before any execution occurs.
Configuring Tasks¶
PipelineTask
s (and their subtasks) contain a multitude of
configuration options that alter the way the task executes. Because
Pipeline
s are designed to do a specific type of processing (per the
description field) some tasks may need specific configurations set to
enable/disable behavior in the context of the specific Pipeline
.
To configure a task associated with a particular label, the value associated
with the label must be changed from the qualified task name to a new
sub-mapping. This new sub mapping should have two keys, class
and
config
.
The class
key should point to the same qualified task name as before. The
value associated with the config
keyword is itself a mapping where
configuration overrides are declared. The example below shows this behavior
in action.
description: A demo pipeline in the how-to guide
tasks:
isr:
class: lsst.ip.isr.IsrTask
config:
doVignette: true
vignetteValue: 0.0
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
calibrate:
class: lsst.pipe.tasks.calibrate.CalibrateTask
config:
astrometry.matcher.maxOffsetPix: 300
This example shows the Pipeline
from the previous section with
configuration overrides applied to two of the tasks. The label isr
is now
associated with the keys class
and config
. The class location is
associated with class
keyword instead of the label directly. The
config
keyword is associated with various Field
s and
the configuration appropriate for this Pipeline
specified as an additional
yaml mapping.
The complete complexity of lsst.pex.config
can’t be represented with simple
yaml mapping syntax. To account for this, config
blocks in Pipeline
s
support two special fields: file
and python
.
The file
key may be associated with either a single value pointing to a
filesystem path where a lsst.pex.config
file can be found, or a yaml list
of such paths. The file paths can contain environment variables that will be
expanded prior to loading the file(s). These files will then be applied to
the task during configuration time to override any default values.
Sometimes configuration is too complex to express with yaml syntax, yet it is
simple enough that it does not warrant its own config file. The python
key is designed to support this use case. The value associated with the key
is a (possibly multi-line) string with valid python syntax. This string is
evaluated and applied during task configuration exactly as if it had been
written in a file or typed out in an interpreter. The following example expands
the previous one to use the python
key.
description: A demo pipeline in the how-to guide
tasks:
isr:
class: lsst.ip.isr.IsrTask
config:
doVignette: true
vignetteValue: 0.0
characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask
calibrate:
class: lsst.pipe.tasks.calibrate.CalibrateTask
config:
astrometry.matcher.maxOffsetPix: 300
python: |
flags = ['base_PixelFlags_flag_edge', 'base_PixelFlags_flag_saturated', 'base_PsfFlux_flags']
config.calibrate.astrometry.sourceSelector['references'].flags.bad = flags
Parameters¶
As you saw in the pervious section, each task defined in a Pipeline
may
have its own configuration. However, it is sometimes useful for configuration
fields in multiple tasks to share the same value. Pipeline
s support this
with a concept called parameters
. This is a top level section in the
Pipeline
document specified with a key named parameters
.
The parameters
section is a mapping of key-value pairs. The keys can then
be used throughout the document in the key-value section of config blocks
instead of using of the concrete parameter value.
To make this a bit clearer take a look at the following example, making note that only config fields relevant for this example are shown.
parameters:
calibratedSingleFrame: calexp
tasks:
calibrate:
class: lsst.pipe.tasks.calibrate.CalibrateTask
config:
connections.outputExposure = parameters.calibratedSingleFrame
makeWarp:
class: lsst.pipe.tasks.makeCoaddTempExp.MakeWarpTask
config:
connections.calExpList = parameters.calibratedSingleFrame
forcedPhotCcd:
class: lsst.meas.base.forcedPhotCcd.ForcedPhotCcdTask
config:
connections.exposure = parameters.calibratedSingleFrame
The above example used parameters
to link the dataset type names for
multiple tasks, but parameters
can be used anywhere that more than one
config field use the same value, it is not restricted to dataset types.
Command line options for running Pipelines introduces how to run Pipeline
s and will
talk about how to dynamically set a parameters
value at Pipeline
invocation time.
Verifying Configuration: Contracts¶
The Config
classes associated with
Task
s provide a method named verify
which can be
used to verify that all supplied configuration is valid. These verify methods
however, are shared by every instance of the config class. This means they
can not be specialized for the context in which the task is being used.
When writing Pipelines
it is sometimes important to verify that
configuration values are either set in such a way to ensure expected
behavior, and/or consistently set between one or more tasks. Pipelines
support this sort of verification with a concept called contracts
. These
contracts
are useful for ensuring two separate config fields are set to
the same value, or ensuring a config parameter is set to a required value in
the context of this pipeline. Because configuration values can be set
anywhere from the Pipeline
definition to the command-line invocation of the
pipeline, these contracts
ensure that required configuration is
appropriate prior to execution.
contracts
are expressions written with Python syntax that should evaluate
to a boolean value. If any contract
evaluates to false, the Pipeline
configuration is deemed to be inconsistent, an error is raised, and
execution of the Pipeline
is halted.
Defining contracts involves adding a new top level key to your document named
contracts
. The value associated with this key is a yaml list of
individual contracts. Each list member may either be the contract
expression or a mapping that contains the expression and a message to include
with an exception if the contract is violated. If the contract is defined as
a mapping, the expression is associated with a key named contract
and the
message is a simple string associated with a key named msg
.
The expressions in the contracts
section reference configuration
parameters for one or more tasks identified by the assigned label in the
task
section. The syntax is similar to that of a task config override
file where the config
variable is replaced with the task label associated
with the task to configure. An example contract to go along with our above
pipeline would be as follows:
contracts:
- characterizeImage.applyApCorr == calibrate.applyApCorr"
This same contract can be defined in a mapping with an associated message as below:
contracts:
- contract: "characterizeImage.applyApCorr ==\
calibrate.applyApCorr"
msg: "The aperture correction sub tasks are not consistent"
It is important to note how contracts
relate to parameters
. While a
parameter
can be used to set two configuration variables to the same
value at the time Pipeline
definition is read, it does not offer any
validation. It is possible for someone to change the configuration of one of
the fields before a Pipeline
is run. Because of this, contracts
should
always be written without regards to how parameters
are used.
Subsets¶
Pipelines
are the definition of a processing workflow from some input data
products to some output data products. Frequently, however, there are sub
units within a Pipeline
that define a useful unit of the Pipeline
to run
on their own. This may be something like processing single frames only.
You, as the author of the Pipeline
, can define one or more of the
processing units by creating a section in your Pipeline
named subsets
.
The value associated with the subsets
key is a new mapping. The keys of
this mapping will be the labels used to refer to an individual subset
.
The values of this mapping can either be a yaml list of the tasks labels to
be associated with this subset, or another yaml mapping. If it is the latter,
the keys must be subset
, which is associated with the yaml list of task
labels, and description
, which is associated with a descriptive message
of what the subset is meant to do. Take a look at the following two examples
which show the same subset
defined in both styles.
subsets:
processCcd:
- isr
- characterizeImage
- calibrate
subsets:
processCcd:
subset:
- isr
- characterizeImage
- calibrate
description: A set of tasks to run when doing single frame processing
Once a subset
is created the label associated with it can be used in
any context where task labels are accepted. Examples of this will be shown
in Command line options for running Pipelines.
Importing¶
Similar to subsets
, which allow defining useful units within a
Pipeline
, it’s sometimes useful to construct a Pipeline
out of other
Pipelines
. This is known as importing a Pipeline
.
Importing other pipelines begins with a top level key named imports
.
The value associated with this key is a yaml list. The values of this list
may be strings corresponding to a filesystem path of the Pipeline
to
import. These paths may contain environment variables to help in writing
paths in a platform agnostic way.
Alternatively, the elements of the imports list may be yaml mapping. This
mapping begins with a key named location
who’s value is the same as the
path described above. The mapping can optionally contain the keys
include
, exclude
, and importContracts
. The keys include
and
exclude
can be used to specify which labels (or labeled subsets) to
include or exclude, respectively, when inheriting a Pipeline
. The values
associated with these keys are specified as a yaml list, and these two keys
are mutually exclusive, only one can be specified at at time. The
importContracts
key is optional and is associated with a boolean value
that controls wether contracts
from the imported pipeline should be
included when importing, with a default value of true.
A few further notes about including and excluding. When specifying labels
with include
or exclude
, it is possible to use a labeled subset in
place of a label. This will have the same effect as typing out all of the
labels listed in the subset. Another important point is the behavior of
labels that are not imported (either because they are excluded, or they are
not part of the include list). If any omitted label appears as part of a
subset, then the subset definition is not imported.
The order that Pipelines
are listed in the imports
section is not
important. Another thing to note is that declared labels must be unique
amongst all inherited Pipelines
.
Once one or more pipelines is imported, the tasks
section is processed.
If any new labels
(and thus PipelineTask
s) are declared they simply
extend the total Pipeline
.
If a label
declared in the the tasks
section was declared in one of
the imported Pipelines
, one of two things happen. If the label is
associated with the same PipelineTask
that was declared in the imported
pipeline, this definition will be extended. This means that any configs
declared in the imported Pipeline
will be merged with configs declared in
the current Pipeline
with the current declaration taking config precedence.
This behavior allows tasks to be extended in the current Pipeline
.
If the label
declared in the current Pipeline
is associated with a
different PipelineTask
than the label
in the imported declaration, then
the label with be considered re-declared and the declaration in the current
Pipeline
will be used. The declaration defined in the imported Pipeline
is dropped.
obs_* package overrides for Pipelines¶
Pipeline
s support automatically loading Task
configuration files defined in obs packages. A top level key named
instrument
is associated with a string representing the fully qualified
class name of the python camera object. For instance, for an obs_subaru
Pipeline
this would look like:
instrument: lsst.obs.subaru.HyperSuprimeCam
The instrument
key is available to all Pipelines
, but by convention
obs_* packages typically will contain Pipelines
that are customized for
the instrument they represent, inside a directory named ‘’pipelines’’. This
includes relevant configs, PipelineTask
(re)declarations, instrument label,
etc. These pipelines can be found inside a directory named pipelines
that
lives at the root of each obs_ package.
These Pipeline
s enable you to run a Pipeline
that is configured for the
desired camera, or can serve as a base for further Pipeline
s to import.
Command line options for running Pipelines¶
This section is not intended to serve as a tutorial for processing data from
the command line, for that refer to lsst.ctrl.mpexec
or lsst.ctrl.bps
.
However, both of these tools accept URI pointers to a Pipeline
. These URIs
can be altered with a specific syntax which will control how the Pipeline
is loaded.
The simplest form of a Pipeline
specification is the URI at which the
Pipeline
can be found. This URI may be any supported by
lsst.daf.butler.ButlerURI
. In the case that the pipeline resides in a file
located on a filesystem accessible by the machine that will be processing the
Pipeline
(i.e. a file URI), there is no need to preface the URI with
file://
, a bare file path is assumed to be a file based URI.
File based URIs also support shell variable expansion. If, for instance, the
URI contains $ENV_VAR
, the variable will be expanded prior to evaluating
the path. A file based URI to a pipeline in an lsst package directory would
look something like:
$PIPE_TASKS_DIR/pipelines/DRP.yaml
As an example of an alternative URI, here is one based on s3 storage:
s3://some_bucket/pipelines/DRP.yaml
For any type of URI, Pipelines
may be specified with additional parameters
specified after a # symbol. The most basic parameter is simply a label.
Loading a Pipeline
with this label specified will cause only this label to
be loaded. It will be as if the Pipeline
only contained that label. This is
useful when you want to run only one PipelineTask
out of an existing
pipeline. Other fields such as contracts (that only contain a reference to
the supplied label) and instrument will also be loaded. Using the example
above, a URI of this form would look something like:
$PIPE_TASKS_DIR/pipelines/DRP.yaml#characterizeImage
Akin to loading a single label, multiple labels may be specified by separating each one with a comma like in the following example.
$PIPE_TASKS_DIR/pipelines/DRP.yaml#isr,characterizeImage,calibrate
Make note, when supplying labels in this way, it is possible to supply a list
of labels who’s task do not define a complete processing flow. While there is
nothing wrong with the Pipeline
itself, there may be missing data products
due to the task that produces it not being in the list of labels to run. When
this happens no processing will be able to be done, and you should look at the
input and output dataset types for each of tasks corresponding to the labels
specified.
As mentioned above, subsets are useful for specifying multiple labels at one
time. As such labeled subsets can be used in the parameter list to indicate
all labels associated with the subset should be run. Also like labels,
multiple labeled subsets can be used by separating each with a comma. As you
can see labeled subsets are essentially synonymous with labels and can be
used interchangeably. As such, it is also possible to mix task labels and
labeled subsets when specifying a parameter list. A Pipeline
URI that
specifies a subset based on our previous example would look like the following:
$PIPE_TASKS_DIR/pipelines/DRP.yaml#processCcd
Pipeline conventions¶
Below is a list of conventions that are commonly used when writing
Pipelines
s. These are not hard requirements, but their use helps maintain
consistency throughout the software stack.
- The name of a Pipeline file should follow class naming conventions (camel case with first letter capital).
- Preface a Pipeline name with an underscore if it is not intended to be inherited and or run directly, which is referred to as a private pipeline (it is part of a larger pipeline).
- Use inheritance to avoid really long documents, using ‘private’
Pipeline
s named as above. Pipeline
s should contain a useful description of what thePipeline
is intended to do.Pipeline
s should be placed in a directory calledpipelines
at the top level of a package.- Instrument packages should provide
Pipeline
s that override standardPipeline
s and are specifically configured for that instrument (if applicable).