Overview

LSST Batch Processing Service (BPS) allows large-scale workflows to execute in well-managed fashion, potentially in multiple environments. The service is provided by the ctrl_bps package. ctrl_bps_htcondor is a plugin allowing ctrl_bps to execute workflows on computational resources managed by HTCondor.

Prerequisites

  1. ctrl_bps, the package providing BPS.

  2. HTCondor cluster.

  3. HTCondor’s Python bindings.

Installing the plugin

Starting from LSST Stack version w_2022_18, the HTCondor plugin package for Batch Processing Service, ctrl_bps_htcondor, comes with lsst_distrib. However, if you’d like to try out its latest features, you may install a bleeding edge version similarly to any other LSST package:

git clone https://github.com/lsst/ctrl_bps_htcondor
cd ctrl_bps_htcondor
setup -k -r .
scons

Specifying the plugin

The class providing HTCondor support for ctrl_bps is

lsst.ctrl.bps.htcondor.HTCondorService

Inform ctrl_bps about its location using one of the methods described in its documentation.

Defining a submission

BPS configuration files are YAML files with some reserved keywords and some special features. See BPS configuration file for details.

The plugin supports all settings described in ctrl_bps documentation except preemptible.

HTCondor is able to to send jobs to run on a remote compute site, even when that compute site is running a non-HTCondor system, by sending “pilot jobs”, or gliedins, to remote batch systems.

Nodes for HTCondor’s glideins can be allocated with help of ctrl_execute. Once you allocated the nodes, you can specify the site where there are available in your BPS configuration file. For example:

site:
  acsws02:
    profile:
      condor:
        requirements: '(ALLOCATED_NODE_SET == "${NODESET}")'
        +JOB_NODE_SET: '"${NODESET}"'

Note

Package ctrl_execute is not the part of the lsst_distrib metapackage and it needs to be (as well as its dependencies) installed manually.

Submitting a run

See bps submit.

Checking status

See bps report.

In order to make the summary report (bps report) faster, the plugin uses summary information available with the DAGMan job. For a running DAG, this status can lag behind by a few minutes. Also, DAGMan tracks deletion of individual jobs as failures (no separate counts for deleted jobs). So the summary report flag column will show F when there are either failed or deleted jobs. If getting a detailed report (bps report --id <id>), the plugin reads detailed job information from files. So, the detailed report can distinguish between failed and deleted jobs, and thus will show D in the flag column for a running workflow if there is a deleted job.

Occasionally, some jobs are put on hold by HTCondor. To see the reason why jobs are being held, use

condor_q -hold <id>    # to see a specific job being held
condor-q -hold <user>  # to see all held jobs owned by the user

Canceling submitted jobs

See bps cancel.

If jobs are hanging around in the queue with an X status in the report displayed by bps report, you can add the following to force delete those jobs from the queue

--pass-thru "-forcex"

Restarting a failed run

See bps restart.

A valid run id is one of the following:

  • job id, e.g., 1234.0 (using just the cluster id, 1234, will also work),

  • global job id (e.g., sdfrome002.sdf.slac.stanford.edu#165725.0#1699393748),

  • run’s submit directory (e.g., /sdf/home/m/mxk/lsst/bps/submit/u/mxk/pipelines_check/20230713T135346Z).

Note

If you don’t remember any of the run’s id you may try running

bps report --username <username> --hist <n>

where <username> and <n> are respectively your user account and the number of past days you would like to include in your search. Keep in mind though that availability of the historical records depends on the HTCondor configuration and the load of the computational resource in use. Consequently, you may still get no results and using the submit directory remains your only option.

When execution of a workflow is managed by HTCondor, the BPS is able to instruct it to automatically retry jobs which failed due to exceeding their memory allocation with increased memory requirements (see the documentation of memoryMultiplier option for more details). However, these increased memory requirements are not preserved between restarts. For example, if a job initially run with 2 GB of memory and failed because of exceeding the limit, HTCondor will retry it with 4 GB of memory. However, if the job and as a result the entire workflow fails again due to other reasons, the job will ask for 2 GB of memory during the first execution after the workflow is restarted.

Troubleshooting

Where is stdout/stderr from pipeline tasks?

For now, stdout/stderr can be found in files in the run submit directory.

Why did my submission fail?

Check the *.dag.dagman.out in run submit directory for errors, in particular for ERROR: submit attempt failed.