Overview¶
LSST Batch Processing Service (BPS) allows large-scale workflows to execute in
well-managed fashion, potentially in multiple environments. The service is
provided by the ctrl_bps package. ctrl_bps_htcondor
is a plugin
allowing ctrl_bps
to execute workflows on computational resources managed by
HTCondor.
Prerequisites¶
Installing the plugin¶
Starting from LSST Stack version w_2022_18
, the HTCondor plugin package for
Batch Processing Service, ctrl_bps_htcondor
, comes with lsst_distrib
.
However, if you’d like to try out its latest features, you may install a
bleeding edge version similarly to any other LSST package:
git clone https://github.com/lsst/ctrl_bps_htcondor
cd ctrl_bps_htcondor
setup -k -r .
scons
Specifying the plugin¶
The class providing HTCondor support for ctrl_bps is
lsst.ctrl.bps.htcondor.HTCondorService
Inform ctrl_bps about its location using one of the methods described in its documentation.
Defining a submission¶
BPS configuration files are YAML files with some reserved keywords and some special features. See BPS configuration file for details.
The plugin supports all settings described in ctrl_bps documentation except preemptible.
Job Ordering¶
This plugin supports both ordering types of group
and noop
.
Job outputs are still underneath the jobs
subdirectory.
If one is looking at HTCondor information directly:
group
ordering is implemented as subdags so you will see more dagman jobs in the queue as well as a newsubdags
subdirectory for the internal files for running a group. To enable running other subdags after a failure but pruning downstream jobs, another job, name starting withwms_check_status
, runs after the subdag to check for a failure and trigger the pruning.noop
ordering is directly implemented as DAGMan NOOP jobs. These jobs do not actually do anything, but provide a mechanism for telling HTCondor about more job dependencies without using a large number (all-to-all) of dependencies.
Job Environment¶
By default, the htcondor jobs copy the environment from the shell in which
bps submit
was executed. To set or override an environment variable via
submission yaml, use an environment
section. Other yaml values and pre-existing
environment variables can be used. Some examples:
environment:
one: 1
two: "2"
three: "spacey 'quoted' value"
MYPATH: "${CTRL_BPS_DIR}/tests"
DAF_BUTLER_CACHE_DIRECTORY: "/tmp/mgower/daf_cache/{run_number}"
Note
The environment
section has to be at the root level. There is no
way to change the environment inside another level (e.g., per site,
per cluster, per pipeline task)
Glideins¶
HTCondor is able to to send jobs to run on a remote compute site, even when that compute site is running a non-HTCondor system, by sending “pilot jobs”, or glideins, to remote batch systems.
Nodes for HTCondor’s glideins can be allocated with help of ctrl_execute. Once you allocated the nodes, you can specify the site where there are available in your BPS configuration file. For example:
site:
acsws02:
profile:
condor:
requirements: '(ALLOCATED_NODE_SET == "${NODESET}")'
+JOB_NODE_SET: '"${NODESET}"'
Submitting a run¶
See bps submit.
Checking status¶
See bps status.
The plugin can take either the HTCondor ID (as shown in bps report
or
condor_q
) or the submit path.
For not-completed workflows, the speed of using the ID can depend on whether on the same submit machine (i.e., local schedd) or not and how busy the schedd machines are. For completed workflows, using the ID may not work if the HTCondor logs have rolled over between the time of completion and time of the status command.
Printing a report¶
See bps report.
In order to make the summary report (bps report
) faster, the plugin
uses summary information available with the DAGMan job. For a running
DAG, this status can lag behind by a few minutes. Also, DAGMan tracks
deletion of individual jobs as failures (no separate counts for
deleted jobs). So the summary report flag column will show F
when
there are either failed or deleted jobs. If getting a detailed report
(bps report --id <ID>
), the plugin reads detailed job information
from files. So, the detailed report can distinguish between failed and
deleted jobs, and thus will show D
in the flag column for a running
workflow if there is a deleted job.
Rarely, a detailed report may warn about job submission issues. For example:
Warn: Job submission issues (last: 01/30/25 10:36:57)
A job submission issue could be intermittent or not. It may cause
problems with the status or counts in the reports. To get more information
about the submission issue, look in the *.dag.dagman.out
file for
errors, in particular lines containing submit attempt failed
.
Occasionally, some jobs are put on hold by HTCondor. To see the reason why jobs are being held, use
condor_q -hold <ID> # to see a specific job being held
condor-q -hold <user> # to see all held jobs owned by the user
Canceling submitted jobs¶
See bps cancel.
If jobs are hanging around in the queue with an X
status in the report
displayed by bps report
, you can add the following to force delete those
jobs from the queue
--pass-thru "-forcex"
Restarting a failed run¶
See bps restart.
A valid run ID is one of the following:
job ID, e.g.,
1234.0
(using just the cluster ID,1234
, will also work),global job ID (e.g.,
sdfrome002.sdf.slac.stanford.edu#165725.0#1699393748
),run’s submit directory (e.g.,
/sdf/home/m/mxk/lsst/bps/submit/u/mxk/pipelines_check/20230713T135346Z
).
Note
If you don’t remember any of the run’s ID you may try running
bps report --username <username> --hist <n>
where <username>
and <n>
are respectively your user account and the
number of past days you would like to include in your search. Keep in mind
though that availability of the historical records depends on the HTCondor
configuration and the load of the computational resource in use.
Consequently, you may still get no results and using the submit directory
remains your only option.
When execution of a workflow is managed by HTCondor, the BPS is able to
instruct it to automatically retry jobs which failed due to exceeding their
memory allocation with increased memory requirements (see the documentation of
memoryMultiplier
option for more details). However, these increased memory
requirements are not preserved between restarts. For example, if a job
initially run with 2 GB of memory and failed because of exceeding the limit,
HTCondor will retry it with 4 GB of memory. However, if the job and as a
result the entire workflow fails again due to other reasons, the job will ask
for 2 GB of memory during the first execution after the workflow is restarted.
Provisioning resources automatically¶
Computational resources required to execute a workflow may not always be managed directly by HTCondor and may need to be provisioned first by a different workload manager, for example, Slurm. In such a case ctrl_bps_htcondor can be instructed to run a provisioning job alongside of the workflow which will firstly create and then maintain glideins necessary for the execution of the workflow.
This provisioning job is called provisioning_job.bash
and is managed by
HTCondor. Be careful not to remove it by accident when using condor_rm
or
kill
command. The job is run on a best-effort basis and will not be
automatically restarted once deleted.
To enable automatic provisioning of the resources, add the following settings to your BPS configuration:
provisionResources: true
provisioning:
provisioningMaxWallTime: <value>
where <value>
is the approximate time your workflow needs to complete,
e.g., 3600, 10:00:00.
This will instruct ctrl_bps_htcondor to include a service job that will run alongside the other payload jobs in the workflow that should automatically create and maintain glideins required for the payload jobs to run.
If you enable automatic provisioning of resources, you will see the status of
the provisioning job in the output of the bps report --id <ID>
command.
Look for the line starting with “Provisioning job status”. For example
X STATE %S ID OPERATOR PROJECT CAMPAIGN PAYLOAD RUN
--- ------- --- ----- -------- ------- -------- ------- ---------------------------------------
RUNNING 0 1.0 jdoe dev quick pcheck u_jdoe_pipelines_check_20240924T201447Z
Path: /home/jdoe/submit/u/jdoe/pipelines_check/20240924T201447Z
Global job id: node001#1.0#1727208891
Provisioning job status: RUNNING
UNKNOWN MISFIT UNREADY READY PENDING RUNNING DELETED HELD SUCCEEDED FAILED PRUNED EXPECTED
----------------- ------- ------ ------- ----- ------- ------- ------- ---- --------- ------ ------ --------
TOTAL 0 0 4 0 1 0 0 0 0 0 0 5
----------------- ------- ------ ------- ----- ------- ------- ------- ---- --------- ------ ------ --------
pipetaskInit 0 0 0 0 1 0 0 0 0 0 0 1
isr 0 0 1 0 0 0 0 0 0 0 0 1
characterizeImage 0 0 1 0 0 0 0 0 0 0 0 1
calibrate 0 0 1 0 0 0 0 0 0 0 0 1
finalJob 0 0 1 0 0 0 0 0 0 0 0 1
If the provisioning job status is UNREADY, check the end of the report to see
if there is a warning about submission issues. There may be a temporary problem.
Check the *.dag.dagman.out
in run submit directory for errors, in
particular for ERROR: submit attempt failed
.
If the provisioning job status is HELD, the hold reason will appear in parentheses.
The service job managing the glideins will be automatically canceled once the
workflow is completed. However, the existing glideins will be left for
HTCondor to shut them down once they remain inactive for the period specified
by provisioningMaxIdleTime
(default value: 15 min., see below) or maximum
wall time is reached.
The provisioning job is expected to run as long as the workflow. If the job
dies, the job status will be FAILED
. If the job just completed successfully,
the job status will be SUCCEEDED
with a message saying it ended early (which
may or may not cause a problem since existing glideins could remain running).
To get more information about either of these cases, check the job output
and error files in the jobs/provisioningJob
subdirectory.
If the automatic provisioning of the resources is enabled, the script that the
service job is supposed to run in order to provide the required resources must
be defined by the provisioningScript
setting in the provisioning
section of your BPS configuration file. By default, ctrl_bps_htcondor will
use allocateNodes.py
from ctrl_execute package with the following
settings:
provisioning:
provisioningNodeCount: 10
provisioningMaxIdleTime: 900
provisioningCheckInterval: 600
provisioningQueue: "milano"
provisioningAccountingUser: "rubin:developers"
provisioningExtraOptions: ""
provisioningPlatform: "s3df"
provisioningScript: |
#!/bin/bash
set -e
set -x
while true; do
${CTRL_EXECUTE_DIR}/bin/allocateNodes.py \
--account {provisioningAccountingUser} \
--auto \
--node-count {provisioningNodeCount} \
--maximum-wall-clock {provisioningMaxWallTime} \
--glidein-shutdown {provisioningMaxIdleTime} \
--queue {provisioningQueue} \
{provisioningExtraOptions} \
{provisioningPlatform}
sleep {provisioningCheckInterval}
done
exit 0
allocateNodes.py
requires a small configuration file located in the user’s
directory to work. With automatic provisioning enabled ctrl_bps_htcondor
will create a new file if it does not exist at the location defined by
provisioningScriptConfigPath
using the template defined by
provisioningScriptConfig
settings in the provisioning
section:
provisioning:
provisioningScriptConfig: |
config.platform["{provisioningPlatform}"].user.name="${USER}"
config.platform["{provisioningPlatform}"].user.home="${HOME}"
provisioningScriptConfigPath: "${HOME}/.lsst/condor-info.py"
If you’re using a custom provisioning script that does not require any
external configuration, set provisioningScriptConfig
to an empty string.
If the file already exists, it will be used as is (BPS will not update it with
config settings). If you wish BPS to overwrite the file with the
provisioningScriptConfig
values, you need to manually remove or rename the
existing file.
Note
${CTRL_BPS_HTCONDOR_DIR}/python/lsst/ctrl/bps/htcondor/etc/htcondor_defaults.yaml
contains default values used by every bps submission when using
ctrl_bps_htcondor
plugin that are automatically included in your
submission configuration.
Releasing held jobs¶
Occasionally, when HTCondor encounters issues during a job’s execution it places the job in the hold state. You can see what jobs you submitted are being currently held and why by using the command:
condor_q -held
If any of your jobs are being held, it will display something similar to:
-- Schedd: sdfrome002.sdf.slac.stanford.edu : <172.24.33.226:21305?... @ 10/02/24 10:59:41
ID OWNER HELD_SINCE HOLD_REASON
5485584.0 jdoe 9/23 11:04 Error from slot_jdoe_8693_1_1@sdfrome051.sdf.slac.stanford.edu: Failed to execute '/sdf/group/rubin/sw/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_mpexec/g1ce94f1343+74d41caebd/bin/pipetask' with arguments --long-log --log-level=VERBOSE run-qbb /repo/ops-rehearsal-3-prep /sdf/home/j/jdoe/u/pipelines/submit/u/jdoe/DM-43059/step3/20240301T190055Z/u_jdoe_step3_20240301T190055Z.qgraph --qgraph-node-id 6b5daf05-10fc-462e-82e0-cc618be83a12: (errno=2: 'No such file or directory')
5471792.0 jdoe 7/10 08:27 File '/sdf/group/rubin/sw/conda/envs/lsst-scipipe-8.0.0/bin/condor_dagman' is missing or not executable
7636239.0 jdoe 3/20 01:32 Job raised a signal 11. Handling signal as if job has gone over memory limit.
5497548.0 jdoe 3/6 00:14 Job raised a signal 9. Handling signal as if job has gone over memory limit.
12863358.0 jdoe 6/27 11:05 Error from slot_jdoe_32400_1_1@sdfrome009.sdf.slac.stanford.edu: Failed to open '/sdf/data/rubin/shared/jdoe/simulation/output/output.0' as standard output: No such file or directory (errno 2)
20590593.0 jdoe 6/23 13:03 Transfer output files failure at the execution point while sending files to access point sdfrome001. Details: reading from file /lscratch/jdoe/execute/dir_1460253/_condor_stdout: (errno 2) No such file or directory
12033406.0 jdoe 5/13 10:48 Cannot access initial working directory /sdf/data/rubin/user/jdoe/repo-main-logs/submit/u/jdoe/20240311T231829Z: No such file or directory
Note
If you would like to display held jobs that were submitted for execution
by other users, use condor_q -held <username>
instead where
<username>
is the user account which held jobs you would like to check.
See condor_q man page for other supported options.
The job that is in the hold state can be released from it with condor_release providing the issue that made HTCondor put it in this state has been resolved. For example, if your job with ID 1234.0 was placed in the hold state because during the execution it exceeded 2048 MiB you requested for it during the submission, you can double the amount of memory it should request with
condor_qedit 1234.0 RequestMemory=4096
and than release it from the hold state with
condor_release 1234.0
When the job is released from the hold state HTCondor puts the job into the IDLE state and will rerun the job using the exact same command and environment as before.
Note
Placing jobs in the hold state due to missing files or directories usually
happens when the gliedins expire or there are some filesystem issues. After
creating new glideins with allocateNodes.py
(see
Provisioning resources automatically for future submissions) or the filesystem
issues have been resolved typically it should be safe to release the jobs
from the hold state.
If multiple jobs were placed by HTCondor in the hold state and you only want to
deal with a subset of currently held jobs, use -constraint <expression>
option that both condor_qedit and condor_release support where
<expression>
can be an arbitrarily complex HTCondor ClassAd expression.
For example
condor_qedit -constraint "JobStatus == 5 && HoldReasonCode == 3 && HoldReasonSubCode == 34" RequestMemory=4096
condor_release -constraint "JobStatus == 5 && HoldReasonCode == 3 && HoldReasonSubCode == 34"
will only affect jobs that were placed in the hold state (JobStatus
is 5)
for a specific reason, here, the memory usage exceeded memory limits
(HoldReasonCode
is 3 and HoldReasonSubCode
is 34).
Note
By default, BPS will automatically retry jobs that failed due to the out of memory error (see Automatic memory scaling section in ctrl_bps documentation for more information regarding this topic) and the issues illustrated by the above examples should only occur if automatic memory scalling was explicitly disabled in the submit YAML file.
Automatic Releasing of Held Jobs¶
Many times releasing the jobs to just try again is successful because the system issues are transient.
releaseExpr
can be set in the submit yaml to add automatic release
conditions. Like other BPS config values, this can be set globally or
set for a specific cluster or pipetask. The number of retries is still
limited by the numberOfRetries
. All held jobs count towards this
limit no matter what the reason. The plugin prohibits the automatic
release of jobs held by user.
Example expressions:
releaseExpr: "True"
- will always release held job unless held by user.releaseExpr: "HoldReasonCode =?= 7"
- release jobs where the standard output file for the job could not be opened.
For more information about expressions, see HTCondor documentation:
HTCondor ClassAd expressions
list of HoldReasonCodes
Warning
System problems should still be tracked and reported. All of the
hold reasons for a single completed run can be found via grep -A
2 held <submit dir>/*.nodes.log
.
Troubleshooting¶
Where is stdout/stderr from pipeline tasks?¶
For now, stdout/stderr can be found in files in the run submit directory
after the job is done. Python logging goes to stderr so the majority
of the pipetask output will be in the *.err file. One exception is
finalJob
which does print some information to stdout (*.out file)
While the job is running, the owner of the job can use condor_tail
command to peek at the stdout/stderr of a job. bps
uses the ID for
the entire workflow. But for the HTCondor command condor_tail
you will need the ID for the individual job. Run the following command
and look for the ID for the job (undefined’s are normal and normally
correspond to the DAGMan jobs).
condor_q -run -nobatch -af:hj bps_job_name bps_run
Once you have the HTCondor ID for the particular job you want to peek at the output, run this command:
condor_tail -stderr -f <ID>
If you want to instead see the stdout, leave off the -stderr
.
If you need to see more of the contents specify -maxbytes <numbytes>
(defaults to 1024 bytes).
I need to look around on the compute node where my job is running.¶
If using glideins, you might be able to just ssh
to the compute
node from the submit node. First, need to find out on which node the
job is running.
condor_q -run -nobatch -af:hj RemoteHost bps_job_name bps_run
Alternatively, HTCondor has the command condor_ssh_to_job
where you
just need the job ID. This is not the workflow ID (the ID that bps
commands use), but an individual job ID. The command above also prints
the job IDs.
Why did my submission fail?¶
Check the *.dag.dagman.out
in run submit directory for errors, in
particular for ERROR: submit attempt failed
.
I enabled automatic provisioning, but my jobs still sit idle in the queue!¶
The service node responsible for executing the provisioning script runs on a best-effort basis. If this node fails to submit correctly or crashes during the workflow execution, this will not register as an error and the workflow will continue normally until the existing gliedins expire. As a result, payload jobs may get stuck in the job queue if the glideins were not created or expired before the execution of the workflow could be completed.
Firstly, use bps report --id <run ID>
to display the run report and look
for the line
Provisioning job status: <status>
If the <status>
is different from RUNNING, it means that the automatic
provisioning is not working. In such a case, create glideins manually to
complete your run.