Overview

LSST Batch Processing Service (BPS) allows large-scale workflows to execute in well-managed fashion, potentially in multiple environments. The service is provided by the ctrl_bps package. ctrl_bps_htcondor is a plugin allowing ctrl_bps to execute workflows on computational resources managed by HTCondor.

Prerequisites

  1. ctrl_bps, the package providing BPS.

  2. HTCondor cluster.

  3. HTCondor’s Python bindings.

Installing the plugin

Starting from LSST Stack version w_2022_18, the HTCondor plugin package for Batch Processing Service, ctrl_bps_htcondor, comes with lsst_distrib. However, if you’d like to try out its latest features, you may install a bleeding edge version similarly to any other LSST package:

git clone https://github.com/lsst/ctrl_bps_htcondor
cd ctrl_bps_htcondor
setup -k -r .
scons

Specifying the plugin

The class providing HTCondor support for ctrl_bps is

lsst.ctrl.bps.htcondor.HTCondorService

Inform ctrl_bps about its location using one of the methods described in its documentation.

Defining a submission

BPS configuration files are YAML files with some reserved keywords and some special features. See BPS configuration file for details.

The plugin supports all settings described in ctrl_bps documentation except preemptible.

Job Ordering

This plugin supports both ordering types of group and noop. Job outputs are still underneath the jobs subdirectory.

If one is looking at HTCondor information directly:

  • group ordering is implemented as subdags so you will see more dagman jobs in the queue as well as a new subdags subdirectory for the internal files for running a group. To enable running other subdags after a failure but pruning downstream jobs, another job, name starting with wms_check_status, runs after the subdag to check for a failure and trigger the pruning.

  • noop ordering is directly implemented as DAGMan NOOP jobs. These jobs do not actually do anything, but provide a mechanism for telling HTCondor about more job dependencies without using a large number (all-to-all) of dependencies.

Job Environment

By default, the htcondor jobs copy the environment from the shell in which bps submit was executed. To set or override an environment variable via submission yaml, use an environment section. Other yaml values and pre-existing environment variables can be used. Some examples:

environment:
  one: 1
  two: "2"
  three: "spacey 'quoted' value"
  MYPATH: "${CTRL_BPS_DIR}/tests"
  DAF_BUTLER_CACHE_DIRECTORY: "/tmp/mgower/daf_cache/{run_number}"

Note

The environment section has to be at the root level. There is no way to change the environment inside another level (e.g., per site, per cluster, per pipeline task)

Glideins

HTCondor is able to to send jobs to run on a remote compute site, even when that compute site is running a non-HTCondor system, by sending “pilot jobs”, or glideins, to remote batch systems.

Nodes for HTCondor’s glideins can be allocated with help of ctrl_execute. Once you allocated the nodes, you can specify the site where there are available in your BPS configuration file. For example:

site:
  acsws02:
    profile:
      condor:
        requirements: '(ALLOCATED_NODE_SET == "${NODESET}")'
        +JOB_NODE_SET: '"${NODESET}"'

Submitting a run

See bps submit.

Checking status

See bps status.

The plugin can take either the HTCondor ID (as shown in bps report or condor_q) or the submit path.

For not-completed workflows, the speed of using the ID can depend on whether on the same submit machine (i.e., local schedd) or not and how busy the schedd machines are. For completed workflows, using the ID may not work if the HTCondor logs have rolled over between the time of completion and time of the status command.

Printing a report

See bps report.

In order to make the summary report (bps report) faster, the plugin uses summary information available with the DAGMan job. For a running DAG, this status can lag behind by a few minutes. Also, DAGMan tracks deletion of individual jobs as failures (no separate counts for deleted jobs). So the summary report flag column will show F when there are either failed or deleted jobs. If getting a detailed report (bps report --id <ID>), the plugin reads detailed job information from files. So, the detailed report can distinguish between failed and deleted jobs, and thus will show D in the flag column for a running workflow if there is a deleted job.

Rarely, a detailed report may warn about job submission issues. For example:

Warn: Job submission issues (last: 01/30/25 10:36:57)

A job submission issue could be intermittent or not. It may cause problems with the status or counts in the reports. To get more information about the submission issue, look in the *.dag.dagman.out file for errors, in particular lines containing submit attempt failed.

Occasionally, some jobs are put on hold by HTCondor. To see the reason why jobs are being held, use

condor_q -hold <ID>    # to see a specific job being held
condor-q -hold <user>  # to see all held jobs owned by the user

Canceling submitted jobs

See bps cancel.

If jobs are hanging around in the queue with an X status in the report displayed by bps report, you can add the following to force delete those jobs from the queue

--pass-thru "-forcex"

Restarting a failed run

See bps restart.

A valid run ID is one of the following:

  • job ID, e.g., 1234.0 (using just the cluster ID, 1234, will also work),

  • global job ID (e.g., sdfrome002.sdf.slac.stanford.edu#165725.0#1699393748),

  • run’s submit directory (e.g., /sdf/home/m/mxk/lsst/bps/submit/u/mxk/pipelines_check/20230713T135346Z).

Note

If you don’t remember any of the run’s ID you may try running

bps report --username <username> --hist <n>

where <username> and <n> are respectively your user account and the number of past days you would like to include in your search. Keep in mind though that availability of the historical records depends on the HTCondor configuration and the load of the computational resource in use. Consequently, you may still get no results and using the submit directory remains your only option.

When execution of a workflow is managed by HTCondor, the BPS is able to instruct it to automatically retry jobs which failed due to exceeding their memory allocation with increased memory requirements (see the documentation of memoryMultiplier option for more details). However, these increased memory requirements are not preserved between restarts. For example, if a job initially run with 2 GB of memory and failed because of exceeding the limit, HTCondor will retry it with 4 GB of memory. However, if the job and as a result the entire workflow fails again due to other reasons, the job will ask for 2 GB of memory during the first execution after the workflow is restarted.

Provisioning resources automatically

Computational resources required to execute a workflow may not always be managed directly by HTCondor and may need to be provisioned first by a different workload manager, for example, Slurm. In such a case ctrl_bps_htcondor can be instructed to run a provisioning job alongside of the workflow which will firstly create and then maintain glideins necessary for the execution of the workflow.

This provisioning job is called provisioning_job.bash and is managed by HTCondor. Be careful not to remove it by accident when using condor_rm or kill command. The job is run on a best-effort basis and will not be automatically restarted once deleted.

To enable automatic provisioning of the resources, add the following settings to your BPS configuration:

provisionResources: true
provisioning:
  provisioningMaxWallTime: <value>

where <value> is the approximate time your workflow needs to complete, e.g., 3600, 10:00:00.

This will instruct ctrl_bps_htcondor to include a service job that will run alongside the other payload jobs in the workflow that should automatically create and maintain glideins required for the payload jobs to run.

If you enable automatic provisioning of resources, you will see the status of the provisioning job in the output of the bps report --id <ID> command. Look for the line starting with “Provisioning job status”. For example

 X   STATE   %S   ID  OPERATOR PROJECT CAMPAIGN PAYLOAD                  RUN
--- ------- --- ----- -------- ------- -------- ------- ---------------------------------------
    RUNNING   0   1.0     jdoe     dev    quick  pcheck u_jdoe_pipelines_check_20240924T201447Z


Path: /home/jdoe/submit/u/jdoe/pipelines_check/20240924T201447Z
Global job id: node001#1.0#1727208891
Provisioning job status: RUNNING


                  UNKNOWN MISFIT UNREADY READY PENDING RUNNING DELETED HELD SUCCEEDED FAILED PRUNED EXPECTED
----------------- ------- ------ ------- ----- ------- ------- ------- ---- --------- ------ ------ --------
TOTAL                   0      0       4     0       1       0       0    0         0      0      0        5
----------------- ------- ------ ------- ----- ------- ------- ------- ---- --------- ------ ------ --------
pipetaskInit            0      0       0     0       1       0       0    0         0      0      0        1
isr                     0      0       1     0       0       0       0    0         0      0      0        1
characterizeImage       0      0       1     0       0       0       0    0         0      0      0        1
calibrate               0      0       1     0       0       0       0    0         0      0      0        1
finalJob                0      0       1     0       0       0       0    0         0      0      0        1

If the provisioning job status is UNREADY, check the end of the report to see if there is a warning about submission issues. There may be a temporary problem. Check the *.dag.dagman.out in run submit directory for errors, in particular for ERROR: submit attempt failed.

If the provisioning job status is HELD, the hold reason will appear in parentheses.

The service job managing the glideins will be automatically canceled once the workflow is completed. However, the existing glideins will be left for HTCondor to shut them down once they remain inactive for the period specified by provisioningMaxIdleTime (default value: 15 min., see below) or maximum wall time is reached.

The provisioning job is expected to run as long as the workflow. If the job dies, the job status will be FAILED. If the job just completed successfully, the job status will be SUCCEEDED with a message saying it ended early (which may or may not cause a problem since existing glideins could remain running). To get more information about either of these cases, check the job output and error files in the jobs/provisioningJob subdirectory.

If the automatic provisioning of the resources is enabled, the script that the service job is supposed to run in order to provide the required resources must be defined by the provisioningScript setting in the provisioning section of your BPS configuration file. By default, ctrl_bps_htcondor will use allocateNodes.py from ctrl_execute package with the following settings:

provisioning:
  provisioningNodeCount: 10
  provisioningMaxIdleTime: 900
  provisioningCheckInterval: 600
  provisioningQueue: "milano"
  provisioningAccountingUser: "rubin:developers"
  provisioningExtraOptions: ""
  provisioningPlatform: "s3df"
  provisioningScript: |
    #!/bin/bash
    set -e
    set -x
    while true; do
        ${CTRL_EXECUTE_DIR}/bin/allocateNodes.py \
            --account {provisioningAccountingUser} \
            --auto \
            --node-count {provisioningNodeCount} \
            --maximum-wall-clock {provisioningMaxWallTime} \
            --glidein-shutdown {provisioningMaxIdleTime} \
            --queue {provisioningQueue} \
            {provisioningExtraOptions} \
            {provisioningPlatform}
        sleep {provisioningCheckInterval}
    done
    exit 0

allocateNodes.py requires a small configuration file located in the user’s directory to work. With automatic provisioning enabled ctrl_bps_htcondor will create a new file if it does not exist at the location defined by provisioningScriptConfigPath using the template defined by provisioningScriptConfig settings in the provisioning section:

provisioning:
  provisioningScriptConfig: |
    config.platform["{provisioningPlatform}"].user.name="${USER}"
    config.platform["{provisioningPlatform}"].user.home="${HOME}"
  provisioningScriptConfigPath: "${HOME}/.lsst/condor-info.py"

If you’re using a custom provisioning script that does not require any external configuration, set provisioningScriptConfig to an empty string.

If the file already exists, it will be used as is (BPS will not update it with config settings). If you wish BPS to overwrite the file with the provisioningScriptConfig values, you need to manually remove or rename the existing file.

Note

${CTRL_BPS_HTCONDOR_DIR}/python/lsst/ctrl/bps/htcondor/etc/htcondor_defaults.yaml contains default values used by every bps submission when using ctrl_bps_htcondor plugin that are automatically included in your submission configuration.

Releasing held jobs

Occasionally, when HTCondor encounters issues during a job’s execution it places the job in the hold state. You can see what jobs you submitted are being currently held and why by using the command:

condor_q -held

If any of your jobs are being held, it will display something similar to:

-- Schedd: sdfrome002.sdf.slac.stanford.edu : <172.24.33.226:21305?... @ 10/02/24 10:59:41
ID           OWNER  HELD_SINCE  HOLD_REASON
5485584.0    jdoe   9/23 11:04  Error from slot_jdoe_8693_1_1@sdfrome051.sdf.slac.stanford.edu: Failed to execute '/sdf/group/rubin/sw/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_mpexec/g1ce94f1343+74d41caebd/bin/pipetask' with arguments --long-log --log-level=VERBOSE run-qbb /repo/ops-rehearsal-3-prep /sdf/home/j/jdoe/u/pipelines/submit/u/jdoe/DM-43059/step3/20240301T190055Z/u_jdoe_step3_20240301T190055Z.qgraph --qgraph-node-id 6b5daf05-10fc-462e-82e0-cc618be83a12: (errno=2: 'No such file or directory')
5471792.0    jdoe   7/10 08:27  File '/sdf/group/rubin/sw/conda/envs/lsst-scipipe-8.0.0/bin/condor_dagman' is missing or not executable
7636239.0    jdoe   3/20 01:32  Job raised a signal 11. Handling signal as if job has gone over memory limit.
5497548.0    jdoe   3/6  00:14  Job raised a signal 9. Handling signal as if job has gone over memory limit.
12863358.0   jdoe   6/27 11:05  Error from slot_jdoe_32400_1_1@sdfrome009.sdf.slac.stanford.edu: Failed to open '/sdf/data/rubin/shared/jdoe/simulation/output/output.0' as standard output: No such file or directory (errno 2)
20590593.0   jdoe   6/23 13:03  Transfer output files failure at the execution point while sending files to access point sdfrome001. Details: reading from file /lscratch/jdoe/execute/dir_1460253/_condor_stdout: (errno 2) No such file or directory
12033406.0   jdoe   5/13 10:48  Cannot access initial working directory /sdf/data/rubin/user/jdoe/repo-main-logs/submit/u/jdoe/20240311T231829Z: No such file or directory

Note

If you would like to display held jobs that were submitted for execution by other users, use condor_q -held <username> instead where <username> is the user account which held jobs you would like to check. See condor_q man page for other supported options.

The job that is in the hold state can be released from it with condor_release providing the issue that made HTCondor put it in this state has been resolved. For example, if your job with ID 1234.0 was placed in the hold state because during the execution it exceeded 2048 MiB you requested for it during the submission, you can double the amount of memory it should request with

condor_qedit 1234.0 RequestMemory=4096

and than release it from the hold state with

condor_release 1234.0

When the job is released from the hold state HTCondor puts the job into the IDLE state and will rerun the job using the exact same command and environment as before.

Note

Placing jobs in the hold state due to missing files or directories usually happens when the gliedins expire or there are some filesystem issues. After creating new glideins with allocateNodes.py (see Provisioning resources automatically for future submissions) or the filesystem issues have been resolved typically it should be safe to release the jobs from the hold state.

If multiple jobs were placed by HTCondor in the hold state and you only want to deal with a subset of currently held jobs, use -constraint <expression> option that both condor_qedit and condor_release support where <expression> can be an arbitrarily complex HTCondor ClassAd expression. For example

condor_qedit -constraint "JobStatus == 5 && HoldReasonCode == 3 && HoldReasonSubCode == 34" RequestMemory=4096
condor_release -constraint "JobStatus == 5 && HoldReasonCode == 3 && HoldReasonSubCode == 34"

will only affect jobs that were placed in the hold state (JobStatus is 5) for a specific reason, here, the memory usage exceeded memory limits (HoldReasonCode is 3 and HoldReasonSubCode is 34).

Note

By default, BPS will automatically retry jobs that failed due to the out of memory error (see Automatic memory scaling section in ctrl_bps documentation for more information regarding this topic) and the issues illustrated by the above examples should only occur if automatic memory scalling was explicitly disabled in the submit YAML file.

Automatic Releasing of Held Jobs

Many times releasing the jobs to just try again is successful because the system issues are transient.

releaseExpr can be set in the submit yaml to add automatic release conditions. Like other BPS config values, this can be set globally or set for a specific cluster or pipetask. The number of retries is still limited by the numberOfRetries. All held jobs count towards this limit no matter what the reason. The plugin prohibits the automatic release of jobs held by user.

Example expressions:

  • releaseExpr: "True" - will always release held job unless held by user.

  • releaseExpr: "HoldReasonCode =?= 7" - release jobs where the standard output file for the job could not be opened.

For more information about expressions, see HTCondor documentation:

Warning

System problems should still be tracked and reported. All of the hold reasons for a single completed run can be found via grep -A 2 held <submit dir>/*.nodes.log.

Troubleshooting

Where is stdout/stderr from pipeline tasks?

For now, stdout/stderr can be found in files in the run submit directory after the job is done. Python logging goes to stderr so the majority of the pipetask output will be in the *.err file. One exception is finalJob which does print some information to stdout (*.out file)

While the job is running, the owner of the job can use condor_tail command to peek at the stdout/stderr of a job. bps uses the ID for the entire workflow. But for the HTCondor command condor_tail you will need the ID for the individual job. Run the following command and look for the ID for the job (undefined’s are normal and normally correspond to the DAGMan jobs).

condor_q -run -nobatch -af:hj bps_job_name bps_run

Once you have the HTCondor ID for the particular job you want to peek at the output, run this command:

condor_tail -stderr -f <ID>

If you want to instead see the stdout, leave off the -stderr. If you need to see more of the contents specify -maxbytes <numbytes> (defaults to 1024 bytes).

I need to look around on the compute node where my job is running.

If using glideins, you might be able to just ssh to the compute node from the submit node. First, need to find out on which node the job is running.

condor_q -run -nobatch -af:hj RemoteHost bps_job_name bps_run

Alternatively, HTCondor has the command condor_ssh_to_job where you just need the job ID. This is not the workflow ID (the ID that bps commands use), but an individual job ID. The command above also prints the job IDs.

Why did my submission fail?

Check the *.dag.dagman.out in run submit directory for errors, in particular for ERROR: submit attempt failed.

I enabled automatic provisioning, but my jobs still sit idle in the queue!

The service node responsible for executing the provisioning script runs on a best-effort basis. If this node fails to submit correctly or crashes during the workflow execution, this will not register as an error and the workflow will continue normally until the existing gliedins expire. As a result, payload jobs may get stuck in the job queue if the glideins were not created or expired before the execution of the workflow could be completed.

Firstly, use bps report --id <run ID> to display the run report and look for the line

Provisioning job status: <status>

If the <status> is different from RUNNING, it means that the automatic provisioning is not working. In such a case, create glideins manually to complete your run.