Using Butler data repositories and reruns with command-line tasks

Command-line tasks use Butler data repositories for reading and writing data. This page describes two ways for specifying data repositories on the command line:

  1. Specify input and output repository URIs (or file paths for POSIX-backed Butler data repositories) with the REPOPATH and --output command-line arguments. Read about this in Using input and output repositories.

  2. Use reruns, which are output data repositories relative to a single root repository, with the REPOPATH and --rerun arguments. Read about this pattern in Using reruns to organize outputs in a single data repository.

About repository paths and URIs

Butler data repositories can be hosted on a variety of backends, from a local POSIX directory to a more specialized data store like Swift. All Butler backends are functionally equivalent from a command line user’s perspective. In each case, you specify a data repository with its URI.

For a POSIX filesystem backend, you can specify a path to the repository directory through:

  • A relative path.

  • An absolute path.

  • A URI prefixed with file://.

Other backends always require a URI. For example, a URI like swift://host/path points to a Butler repository backed by the Swift object store.

In the how-to topics below, URIs or POSIX paths can be used as needed for the inputrepo (REPOPATH) and outputrepo (--output) command-line arguments.

Using input and output repositories

How to create a new output repository

To have a command-line task read data from a inputrepo repository and write to a outputrepo output repository, set the REPOPATH and --output arguments like this:

task.py inputrepo --output outputrepo ...

The outputrepo directory will be created if it does not already exist.

How to chain output repositories

The output repository for one task can become the input repository for the next command-line task. For example:

task2.py outputrepo --output outputrepo2 ...

Because Butler data repositories are chained, the output repository (here, outputrepo2) provides access to all the datasets from the input repositories (here: inputrepo, outputrepo, and outputrepo2 itself).

How to re-use output repositories

An output repository can be the same as the input repository:

task3.py outputrepo2 --output outputrepo2 ...

This pattern is useful for reducing the number of repositories. Packing outputs from multiple tasks into one output repository does reduce your flexibility to run a task several times with different configurations and compare outputs, though.

You can also run the same task multiple times with the same output repository. Be aware that the Science Pipelines will help you maintain the integrity of the processed data’s provenance. If you change a task’s configuration and re-run the task into the same output repository, an error “Config does not match existing task config” will be shown. See Working with provenance checks in command-line tasks.

How to use repository path environment variables

The PIPE_INPUT_ROOT and PIPE_OUTPUT_ROOT environment variables can help you specify data repository paths more succinctly. When set, the REPOPATH argument path is treated as relative to PIPE_INPUT_ROOT and the --output path is relative to PIPE_OUTPUT_ROOT.

These environment variables are optional. Then they aren’t set in your shell, the REPOPATH and --output arguments alone specify the paths or URIs to Butler data repositories.

See Environment variables for details.

Using reruns to organize outputs in a single data repository

An alternative way to organize output data repositories is with reruns (--rerun command-line argument) Reruns are a convention for repositories that are located relative to single root data repository. If the root repository’s URI is file://REPOPATH, a rerun called my_rerun automatically has a full URI of:

file://REPOPATH/rerun/my_rerun

In practice, you don’t need to know the full URIs of individual reruns. Instead, you just need to know the URI of the root repository and the names of individual reruns. This makes reruns especially convenient in practice.

How to create a rerun

To use input data from a DATA Butler repository and write outputs to a rerun called A, set a command-line task’s REPOPATH and --rerun like this:

task1.py DATA --rerun A ...

Tip

Once you’ve created an output rerun with one command-line task you can re-use it as the output repository for subsequent command-line task runs (see How to write outputs to an existing rerun)

Alternatively, you can chain new reruns together with each processing step (see the next section).

For perspective on when to create a new rerun, or reuse an existing one, see When to create a new rerun.

How to use one rerun as input to another (chaining)

To use data written to rerun A as inputs but have results written to a new rerun B, use the --rerun argument’s input:output syntax, like this:

task2.py DATA --rerun A:B ...

This syntax automatically chains rerun B to rerun A, just like Butler repository chaining in general (see How to chain output repositories). For example if rerun B is later used as an input rerun, it will provide access to datasets in rerun B, rerun A, and the root repository DATA itself.

How to write outputs to an existing rerun

Tasks can write to an existing rerun. For example, if rerun B was already created you can write additional outputs to it:

task3.py DATA --rerun B ...

Because reruns are chained, the Butler will start looking for datasets in this rerun B, then in the chained A rerun, all the way to the root data repository (DATA). This chaining is transparent to you. You don’t need to know which repository in the chain a given input dataset comes from. All you need to know is the root data repository and the terminal rerun’s name.

When reusing a rerun for multiple runs of the same command-line task, be aware of configuration consistency checks. See Working with provenance checks in command-line tasks for more information.

When to create a new rerun

When using multiple command-line tasks to process data, you have the option of re-using the same rerun or creating a new chained rerun for each successive task. How you use reruns is up to you.

Reruns are useful for creating processing checkpoints (hence their name). You can run the same task with different configurations, writing the output of each to a different rerun. By analyzing and comparing equivalent datasets in each rerun, you can make informed decisions about task configuration.

Without using separate reruns, tasks will report an error if the same task is processing data with different configurations than before. These checks are in place to ensure that the provenance of data processing is traceable. See Working with provenance checks in command-line tasks for more information.