Overview¶
LSST Batch Processing Service (BPS) allows large-scale workflows to execute in
well-managed fashion, potentially in multiple environments. The service is
provided by the ctrl_bps package. ctrl_bps_htcondor
is a plugin
allowing ctrl_bps
to execute workflows on computational resources managed by
HTCondor.
Prerequisites¶
Installing the plugin¶
Starting from LSST Stack version w_2022_18
, the HTCondor plugin package for
Batch Processing Service, ctrl_bps_htcondor
, comes with lsst_distrib
.
However, if you’d like to try out its latest features, you may install a
bleeding edge version similarly to any other LSST package:
git clone https://github.com/lsst/ctrl_bps_htcondor
cd ctrl_bps_htcondor
setup -k -r .
scons
Specifying the plugin¶
The class providing HTCondor support for ctrl_bps is
lsst.ctrl.bps.htcondor.HTCondorService
Inform ctrl_bps about its location using one of the methods described in its documentation.
Defining a submission¶
BPS configuration files are YAML files with some reserved keywords and some special features. See BPS configuration file for details.
The plugin supports all settings described in ctrl_bps documentation except preemptible.
HTCondor is able to to send jobs to run on a remote compute site, even when that compute site is running a non-HTCondor system, by sending “pilot jobs”, or gliedins, to remote batch systems.
Nodes for HTCondor’s glideins can be allocated with help of ctrl_execute. Once you allocated the nodes, you can specify the site where there are available in your BPS configuration file. For example:
site:
acsws02:
profile:
condor:
requirements: '(ALLOCATED_NODE_SET == "${NODESET}")'
+JOB_NODE_SET: '"${NODESET}"'
Note
Package ctrl_execute is not the part of the lsst_distrib metapackage and it needs to be (as well as its dependencies) installed manually.
Submitting a run¶
See bps submit.
Checking status¶
See bps report.
In order to make the summary report (bps report
) faster, the plugin
uses summary information available with the DAGMan job. For a running
DAG, this status can lag behind by a few minutes. Also, DAGMan tracks
deletion of individual jobs as failures (no separate counts for
deleted jobs). So the summary report flag column will show F
when
there are either failed or deleted jobs. If getting a detailed report
(bps report --id <id>
), the plugin reads detailed job information
from files. So, the detailed report can distinguish between failed and
deleted jobs, and thus will show D
in the flag column for a running
workflow if there is a deleted job.
Occasionally, some jobs are put on hold by HTCondor. To see the reason why jobs are being held, use
condor_q -hold <id> # to see a specific job being held
condor-q -hold <user> # to see all held jobs owned by the user
Canceling submitted jobs¶
See bps cancel.
If jobs are hanging around in the queue with an X
status in the report
displayed by bps report
, you can add the following to force delete those
jobs from the queue
--pass-thru "-forcex"
Restarting a failed run¶
See bps restart.
A valid run id is one of the following:
job id, e.g.,
1234.0
(using just the cluster id,1234
, will also work),global job id (e.g.,
sdfrome002.sdf.slac.stanford.edu#165725.0#1699393748
),run’s submit directory (e.g.,
/sdf/home/m/mxk/lsst/bps/submit/u/mxk/pipelines_check/20230713T135346Z
).
Note
If you don’t remember any of the run’s id you may try running
bps report --username <username> --hist <n>
where <username>
and <n>
are respectively your user account and the
number of past days you would like to include in your search. Keep in mind
though that availability of the historical records depends on the HTCondor
configuration and the load of the computational resource in use.
Consequently, you may still get no results and using the submit directory
remains your only option.
When execution of a workflow is managed by HTCondor, the BPS is able to
instruct it to automatically retry jobs which failed due to exceeding their
memory allocation with increased memory requirements (see the documentation of
memoryMultiplier
option for more details). However, these increased memory
requirements are not preserved between restarts. For example, if a job
initially run with 2 GB of memory and failed because of exceeding the limit,
HTCondor will retry it with 4 GB of memory. However, if the job and as a
result the entire workflow fails again due to other reasons, the job will ask
for 2 GB of memory during the first execution after the workflow is restarted.
Troubleshooting¶
Where is stdout/stderr from pipeline tasks?¶
For now, stdout/stderr can be found in files in the run submit directory.
Why did my submission fail?¶
Check the *.dag.dagman.out
in run submit directory for errors, in
particular for ERROR: submit attempt failed
.