.. _pipe_base_creating_pipeline: ################### Creating a Pipeline ################### **Note** This guide assumes some knowledge about `PipelineTask`\ s, and so if you would like you can check out :doc:`Creating a PipelineTask ` for info on what a `PipelineTask` is and how to make one. Otherwise, this guide attempts to be mostly stand alone, and should be readable with minimal references. .... `PipelineTask`\ s are bits of algorithmic code that define what data they need as input, what they will produce as an output, and a ``run`` method which produces this output. `Pipeline`\ s are high level documents that create a specification that is used to run one or more `PipelineTask`\ s. This how-to guide guide will introduce you to the basic syntax of a `Pipeline` document, and progressively take you through; configuring tasks, verifying configuration, specifying subsets of tasks, creating `Pipeline`\ s using composition, a basic introduction to options when running `Pipeline`\ s, and discussing common conventions when creating `Pipelines`. ---------------- A Basic Pipeline ---------------- `Pipeline` documents are written using yaml syntax. If you are unfamiliar with yaml, there are many guides across the internet, but the basic idea is that it is a simple markup language to describe key, value mappings, and lists of values (which may be further mappings). `Pipeline`\ s have two required keys, ``description`` and ``tasks``. The value associated with the ``description`` should provide a reader an understanding of what the `Pipeline` is intended to do and is written as plain text. The second required key is ``tasks``. Unlike ``description``, which has plain text as a value, the 'value' associated with the ``tasks`` keys is another key-value mapping. This section defines what work this pipeline will do. The keys of the inner mapping are labels which will be used to refer to an individual task. These labels can be any name you choose, the only restriction is that they must be unique amongst all the tasks. The values in this mapping can be a number of things, which you will see through the course of this guide, but the most basic is a string that gives the fully qualified `PipelineTask` that is to be run. This is a lot of text to digest, so take a look at the following example as a 'picture' is worth a thousand words. .. code-block:: yaml description: A demo pipeline in the how-to guide tasks: characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask Writing this and saving it to a file with a .yaml extension is all it takes to have a simple pipeline. The ``description`` reflects that this `Pipeline` is intended just for this guide. The ``tasks`` section contains only one entry. The label used for this entry, ``characterizeImage``, happens to match the module of the task it points to. It could have been anything, but the name was suitably descriptive, so it was a good choice. If run, this `Pipeline` would execute `CharacterizeImageTask` processing the datasets declared in that task, and write the declared outputs. Having a pipeline to run a single `PipelineTask` does not seem very useful. The examples below (and in subsequent sections) are a bit more realistic. .. code-block:: yaml description: A demo pipeline in the how-to guide tasks: isr: lsst.ip.isr.IsrTask characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask calibrate: lsst.pipe.tasks.calibrate.CalibrateTask This `Pipeline` contains 3 tasks to run, all of which are steps in processing a single frame. The order that the tasks are executed is not determined by the ordering of the tasks in the pipeline, but by the definition/configuration of the `PipelineTask`\ s. It is the job of the execution system to work out this order, as such you may write the tasks in any order in the `Pipeline`. The following `Pipeline` is exactly the same from an execution point of view. With that said, be kind to human readers and if possible write the tasks in the order you expect the tasks to execute most often so readers can gain some intuition. .. code-block:: yaml description: A demo pipeline in the how-to guide tasks: characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask calibrate: lsst.pipe.tasks.calibrate.CalibrateTask isr: lsst.ip.isr.IsrTask Tasks define their inputs and outputs which are used to construct an execution graph of the specified tasks. A consequence of this is if a pipeline does not define all the tasks required to generate all needed inputs and outputs it will get caught before any execution occurs. ----------------- Configuring Tasks ----------------- `PipelineTask`\ s (and their subtasks) contain a multitude of configuration options that alter the way the task executes. Because `Pipeline`\ s are designed to do a specific type of processing (per the description field) some tasks may need specific configurations set to enable/disable behavior in the context of the specific `Pipeline`. To configure a task associated with a particular label, the value associated with the label must be changed from the qualified task name to a new sub-mapping. This new sub mapping should have two keys, ``class`` and ``config``. The ``class`` key should point to the same qualified task name as before. The value associated with the ``config`` keyword is itself a mapping where configuration overrides are declared. The example below shows this behavior in action. .. code-block:: yaml description: A demo pipeline in the how-to guide tasks: isr: class: lsst.ip.isr.IsrTask config: doVignette: true vignetteValue: 0.0 characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask calibrate: class: lsst.pipe.tasks.calibrate.CalibrateTask config: astrometry.matcher.maxOffsetPix: 300 This example shows the `Pipeline` from the previous section with configuration overrides applied to two of the tasks. The label ``isr`` is now associated with the keys ``class`` and ``config``. The class location is associated with ``class`` keyword instead of the label directly. The ``config`` keyword is associated with various `~lsst.pex.config.Field`\ s and the configuration appropriate for this `Pipeline` specified as an additional yaml mapping. The complete complexity of `lsst.pex.config` can't be represented with simple yaml mapping syntax. To account for this, ``config`` blocks in `Pipeline`\ s support two special fields: ``file`` and ``python``. The ``file`` key may be associated with either a single value pointing to a filesystem path where a `lsst.pex.config` file can be found, or a yaml list of such paths. The file paths can contain environment variables that will be expanded prior to loading the file(s). These files will then be applied to the task during configuration time to override any default values. Sometimes configuration is too complex to express with yaml syntax, yet it is simple enough that it does not warrant its own config file. The ``python`` key is designed to support this use case. The value associated with the key is a (possibly multi-line) string with valid python syntax. This string is evaluated and applied during task configuration exactly as if it had been written in a file or typed out in an interpreter. The following example expands the previous one to use the ``python`` key. .. code-block:: yaml description: A demo pipeline in the how-to guide tasks: isr: class: lsst.ip.isr.IsrTask config: doVignette: true vignetteValue: 0.0 characterizeImage: lsst.pipe.tasks.characterizeImage.CharacterizeImageTask calibrate: class: lsst.pipe.tasks.calibrate.CalibrateTask config: astrometry.matcher.maxOffsetPix: 300 python: | flags = ['base_PixelFlags_flag_edge', 'base_PixelFlags_flag_saturated', 'base_PsfFlux_flags'] config.calibrate.astrometry.sourceSelector['references'].flags.bad = flags ---------- Parameters ---------- As you saw in the pervious section, each task defined in a `Pipeline` may have its own configuration. However, it is sometimes useful for configuration fields in multiple tasks to share the same value. `Pipeline`\ s support this with a concept called ``parameters``. This is a top level section in the `Pipeline` document specified with a key named ``parameters``. The ``parameters`` section is a mapping of key-value pairs. The keys can then be used throughout the document in the key-value section of config blocks instead of using of the concrete parameter value. To make this a bit clearer take a look at the following example, making note that only config fields relevant for this example are shown. .. code-block:: yaml parameters: calibratedSingleFrame: calexp tasks: calibrate: class: lsst.pipe.tasks.calibrate.CalibrateTask config: connections.outputExposure = parameters.calibratedSingleFrame makeWarp: class: lsst.pipe.tasks.makeCoaddTempExp.MakeWarpTask config: connections.calExpList = parameters.calibratedSingleFrame forcedPhotCcd: class: lsst.meas.base.forcedPhotCcd.ForcedPhotCcdTask config: connections.exposure = parameters.calibratedSingleFrame The above example used ``parameters`` to link the dataset type names for multiple tasks, but ``parameters`` can be used anywhere that more than one config field use the same value, it is not restricted to dataset types. :ref:`pipeline-running-intro` introduces how to run `Pipeline`\ s and will talk about how to dynamically set a ``parameters`` value at `Pipeline` invocation time. ---------------------------------- Verifying Configuration: Contracts ---------------------------------- The `~lsst.pipe.base.config.Config` classes associated with `~lsst.pipe.base.task.Task`\ s provide a method named ``verify`` which can be used to verify that all supplied configuration is valid. These verify methods however, are shared by every instance of the config class. This means they can not be specialized for the context in which the task is being used. When writing `Pipelines` it is sometimes important to verify that configuration values are either set in such a way to ensure expected behavior, and/or consistently set between one or more tasks. `Pipelines` support this sort of verification with a concept called ``contracts``. These ``contracts`` are useful for ensuring two separate config fields are set to the same value, or ensuring a config parameter is set to a required value in the context of this pipeline. Because configuration values can be set anywhere from the `Pipeline` definition to the command-line invocation of the pipeline, these ``contracts`` ensure that required configuration is appropriate prior to execution. ``contracts`` are expressions written with Python syntax that should evaluate to a boolean value. If any ``contract`` evaluates to false, the `Pipeline` configuration is deemed to be inconsistent, an error is raised, and execution of the `Pipeline` is halted. Defining contracts involves adding a new top level key to your document named ``contracts``. The value associated with this key is a yaml list of individual contracts. Each list member may either be the ``contract`` expression or a mapping that contains the expression and a message to include with an exception if the contract is violated. If the contract is defined as a mapping, the expression is associated with a key named ``contract`` and the message is a simple string associated with a key named ``msg``. The expressions in the ``contracts`` section reference configuration parameters for one or more tasks identified by the assigned label in the ``task`` section. The syntax is similar to that of a task config override file where the ``config`` variable is replaced with the task label associated with the task to configure. An example contract to go along with our above pipeline would be as follows: .. code-block:: yaml contracts: - characterizeImage.applyApCorr == calibrate.applyApCorr" This same contract can be defined in a mapping with an associated message as below: .. code-block:: yaml contracts: - contract: "characterizeImage.applyApCorr ==\ calibrate.applyApCorr" msg: "The aperture correction sub tasks are not consistent" It is important to note how ``contracts`` relate to ``parameters``. While a ``parameter`` can be used to set two configuration variables to the same value at the time `Pipeline` definition is read, it does not offer any validation. It is possible for someone to change the configuration of one of the fields before a `Pipeline` is run. Because of this, ``contracts`` should always be written without regards to how ``parameters`` are used. ------- Subsets ------- `Pipelines` are the definition of a processing workflow from some input data products to some output data products. Frequently, however, there are sub units within a `Pipeline` that define a useful unit of the `Pipeline` to run on their own. This may be something like processing single frames only. You, as the author of the `Pipeline`, can define one or more of the processing units by creating a section in your `Pipeline` named ``subsets``. The value associated with the ``subsets`` key is a new mapping. The keys of this mapping will be the labels used to refer to an individual ``subset``. The values of this mapping can either be a yaml list of the tasks labels to be associated with this subset, or another yaml mapping. If it is the latter, the keys must be ``subset``, which is associated with the yaml list of task labels, and ``description``, which is associated with a descriptive message of what the subset is meant to do. Take a look at the following two examples which show the same ``subset`` defined in both styles. .. code-block:: yaml subsets: processCcd: - isr - characterizeImage - calibrate .. code-block:: yaml subsets: processCcd: subset: - isr - characterizeImage - calibrate description: A set of tasks to run when doing single frame processing Once a ``subset`` is created the label associated with it can be used in any context where task labels are accepted. Examples of this will be shown in :ref:`pipeline-running-intro`. ----------- Importing ----------- Similar to ``subsets``, which allow defining useful units within a `Pipeline`, it's sometimes useful to construct a `Pipeline` out of other `Pipelines`. This is known as importing a `Pipeline`. Importing other pipelines begins with a top level key named ``imports``. The value associated with this key is a yaml list. The values of this list may be strings corresponding to a filesystem path of the `Pipeline` to import. These paths may contain environment variables to help in writing paths in a platform agnostic way. Alternatively, the elements of the imports list may be yaml mapping. This mapping begins with a key named ``location`` who's value is the same as the path described above. The mapping can optionally contain the keys ``include``, ``exclude``, and ``importContracts``. The keys ``include`` and ``exclude`` can be used to specify which labels (or labeled subsets) to include or exclude, respectively, when inheriting a ``Pipeline``. The values associated with these keys are specified as a yaml list, and these two keys are mutually exclusive, only one can be specified at at time. The ``importContracts`` key is optional and is associated with a boolean value that controls wether ``contracts`` from the imported pipeline should be included when importing, with a default value of true. A few further notes about including and excluding. When specifying labels with ``include`` or ``exclude``, it is possible to use a labeled subset in place of a label. This will have the same effect as typing out all of the labels listed in the subset. Another important point is the behavior of labels that are not imported (either because they are excluded, or they are not part of the include list). If any omitted label appears as part of a subset, then the subset definition is not imported. The order that `Pipelines` are listed in the ``imports`` section is not important. Another thing to note is that declared labels must be unique amongst all inherited `Pipelines`. Once one or more pipelines is imported, the ``tasks`` section is processed. If any new ``labels`` (and thus `PipelineTask`\ s) are declared they simply extend the total `Pipeline`. If a ``label`` declared in the the ``tasks`` section was declared in one of the imported ``Pipelines``, one of two things happen. If the label is associated with the same `PipelineTask` that was declared in the imported pipeline, this definition will be extended. This means that any configs declared in the imported `Pipeline` will be merged with configs declared in the current `Pipeline` with the current declaration taking config precedence. This behavior allows tasks to be extended in the current `Pipeline`. If the ``label`` declared in the current `Pipeline` is associated with a different `PipelineTask` than the ``label`` in the imported declaration, then the label with be considered re-declared and the declaration in the current `Pipeline` will be used. The declaration defined in the imported `Pipeline` is dropped. -------------------------------------- obs\_* package overrides for Pipelines -------------------------------------- `Pipeline`\ s support automatically loading `~lsst.pipe.base.Task` configuration files defined in obs packages. A top level key named `instrument` is associated with a string representing the fully qualified class name of the python camera object. For instance, for an ``obs_subaru`` `Pipeline` this would look like: .. code-block:: yaml instrument: lsst.obs.subaru.HyperSuprimeCam The ``instrument`` key is available to all `Pipelines`, but by convention obs\_* packages typically will contain `Pipelines` that are customized for the instrument they represent, inside a directory named ''pipelines''. This includes relevant configs, `PipelineTask` (re)declarations, instrument label, etc. These pipelines can be found inside a directory named `pipelines` that lives at the root of each obs\_ package. These `Pipeline`\ s enable you to run a `Pipeline` that is configured for the desired camera, or can serve as a base for further `Pipeline`\ s to import. .. _pipeline-running-intro: ------------------------------------------ Command line options for running Pipelines ------------------------------------------ This section is not intended to serve as a tutorial for processing data from the command line, for that refer to `lsst.ctrl.mpexec` or `lsst.ctrl.bps`. However, both of these tools accept URI pointers to a `Pipeline`. These URIs can be altered with a specific syntax which will control how the `Pipeline` is loaded. The simplest form of a `Pipeline` specification is the URI at which the `Pipeline` can be found. This URI may be any supported by `lsst.daf.butler.ButlerURI`. In the case that the pipeline resides in a file located on a filesystem accessible by the machine that will be processing the `Pipeline` (i.e. a file URI), there is no need to preface the URI with ``file://``, a bare file path is assumed to be a file based URI. File based URIs also support shell variable expansion. If, for instance, the URI contains ``$ENV_VAR``, the variable will be expanded prior to evaluating the path. A file based URI to a pipeline in an lsst package directory would look something like: ``$PIPE_TASKS_DIR/pipelines/DRP.yaml`` As an example of an alternative URI, here is one based on s3 storage: ``s3://some_bucket/pipelines/DRP.yaml`` For any type of URI, `Pipelines` may be specified with additional parameters specified after a # symbol. The most basic parameter is simply a label. Loading a `Pipeline` with this label specified will cause only this label to be loaded. It will be as if the `Pipeline` only contained that label. This is useful when you want to run only one `PipelineTask` out of an existing pipeline. Other fields such as contracts (that only contain a reference to the supplied label) and instrument will also be loaded. Using the example above, a URI of this form would look something like: ``$PIPE_TASKS_DIR/pipelines/DRP.yaml#characterizeImage`` Akin to loading a single label, multiple labels may be specified by separating each one with a comma like in the following example. ``$PIPE_TASKS_DIR/pipelines/DRP.yaml#isr,characterizeImage,calibrate`` Make note, when supplying labels in this way, it is possible to supply a list of labels who's task do not define a complete processing flow. While there is nothing wrong with the `Pipeline` itself, there may be missing data products due to the task that produces it not being in the list of labels to run. When this happens no processing will be able to be done, and you should look at the input and output dataset types for each of tasks corresponding to the labels specified. As mentioned above, subsets are useful for specifying multiple labels at one time. As such labeled subsets can be used in the parameter list to indicate all labels associated with the subset should be run. Also like labels, multiple labeled subsets can be used by separating each with a comma. As you can see labeled subsets are essentially synonymous with labels and can be used interchangeably. As such, it is also possible to mix task labels and labeled subsets when specifying a parameter list. A `Pipeline` URI that specifies a subset based on our previous example would look like the following: ``$PIPE_TASKS_DIR/pipelines/DRP.yaml#processCcd`` -------------------- Pipeline conventions -------------------- Below is a list of conventions that are commonly used when writing `Pipelines`\ s. These are not hard requirements, but their use helps maintain consistency throughout the software stack. * The name of a Pipeline file should follow class naming conventions (camel case with first letter capital). * Preface a Pipeline name with an underscore if it is not intended to be inherited and or run directly, which is referred to as a private pipeline (it is part of a larger pipeline). * Use inheritance to avoid really long documents, using 'private' `Pipeline`\ s named as above. * `Pipeline`\ s should contain a useful description of what the `Pipeline` is intended to do. * `Pipeline`\ s should be placed in a directory called ``pipelines`` at the top level of a package. * Instrument packages should provide `Pipeline`\ s that override standard `Pipeline`\ s and are specifically configured for that instrument (if applicable).