.. _htc-plugin-overview: Overview -------- LSST Batch Processing Service (BPS) allows large-scale workflows to execute in well-managed fashion, potentially in multiple environments. The service is provided by the `ctrl_bps`_ package. ``ctrl_bps_htcondor`` is a plugin allowing `ctrl_bps` to execute workflows on computational resources managed by `HTCondor`_. .. _htc-plugin-preqs: Prerequisites ------------- #. `ctrl_bps`_, the package providing BPS. #. `HTCondor`_ cluster. #. HTCondor's Python `bindings`__. .. __: https://htcondor.readthedocs.io/en/latest/apis/python-bindings/index.html .. _htc-plugin-installing: Installing the plugin --------------------- Starting from LSST Stack version ``w_2022_18``, the HTCondor plugin package for Batch Processing Service, ``ctrl_bps_htcondor``, comes with ``lsst_distrib``. However, if you'd like to try out its latest features, you may install a bleeding edge version similarly to any other LSST package: .. code-block:: bash git clone https://github.com/lsst/ctrl_bps_htcondor cd ctrl_bps_htcondor setup -k -r . scons .. _htc-plugin-wmsclass: Specifying the plugin --------------------- The class providing `HTCondor`_ support for `ctrl_bps`_ is :: lsst.ctrl.bps.htcondor.HTCondorService Inform `ctrl_bps`_ about its location using one of the methods described in its `documentation`__. .. __: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/index.html .. _htc-plugin-defining-submission: Defining a submission --------------------- BPS configuration files are YAML files with some reserved keywords and some special features. See `BPS configuration file`__ for details. The plugin supports all settings described in `ctrl_bps documentation`__ *except* **preemptible**. .. Describe any plugin specific aspects of defining a submission below if any. `HTCondor`_ is able to to send jobs to run on a remote compute site, even when that compute site is running a non-HTCondor system, by sending "pilot jobs", or **gliedins**, to remote batch systems. Nodes for HTCondor's glideins can be allocated with help of `ctrl_execute`_. Once you allocated the nodes, you can specify the site where there are available in your BPS configuration file. For example: .. code-block:: YAML site: acsws02: profile: condor: requirements: '(ALLOCATED_NODE_SET == "${NODESET}")' +JOB_NODE_SET: '"${NODESET}"' .. note:: Package `ctrl_execute`_ is not the part of the `lsst_distrib`_ metapackage and it needs to be (as well as its dependencies) installed manually. .. __: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#bps-configuration-file .. __: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#supported-settings .. .. _htc-plugin-authenticating: .. Authenticating .. -------------- .. Describe any plugin specific aspects of an authentication below if any. .. _htc-plugin-submit: Submitting a run ---------------- See `bps submit`_. .. Describe any plugin specific aspects of a submission below if any. .. _htc-plugin-report: Checking status --------------- See `bps report`_. .. Describe any plugin specific aspects of checking a submission status below if any. Occasionally, some jobs are put on hold by HTCondor. To see the reason why jobs are being held, use .. code-block:: bash condor_q -hold <id> # to see a specific job being held condor-q -hold <user> # to see all held jobs owned by the user .. _htc-plugin-cancel: Canceling submitted jobs ------------------------ See `bps cancel`_. .. Describe any plugin specific aspects of canceling submitted jobs below if any. If jobs are hanging around in the queue with an ``X`` status in the report displayed by ``bps report``, you can add the following to force delete those jobs from the queue :: --pass-thru "-forcex" .. _htc-plugin-restart: Restarting a failed run ----------------------- See `bps restart`_. .. Describe any plugin specific aspects of restarting failed jobs below if any. When execution of a workflow is managed by `HTCondor`_, the BPS is able to instruct it to automatically retry jobs which failed due to exceeding their memory allocation with increased memory requirements (see the documentation of ``memoryMultiplier`` option for more details). However, these increased memory requirements are not preserved between restarts. For example, if a job initially run with 2 GB of memory and failed because of exceeding the limit, `HTCondor`_ will retry it with 4 GB of memory. However, if the job and as a result the entire workflow fails again due to other reasons, the job will ask for 2 GB of memory during the first execution after the workflow is restarted. .. _htc-plugin-troubleshooting: Troubleshooting --------------- Where is stdout/stderr from pipeline tasks? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For now, stdout/stderr can be found in files in the run submit directory. Why did my submission fail? ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Check the ``*.dag.dagman.out`` in run submit directory for errors, in particular for ``ERROR: submit attempt failed``. .. _HTCondor: https://htcondor.readthedocs.io/en/latest/ .. _bps cancel: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#canceling-submitted-jobs .. _bps report: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#checking-status .. _bps restart: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#restarting-a-failed-run .. _bps submit: https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#submitting-a-run .. _ctrl_bps: https://github.com/lsst/ctrl_bps .. _ctrl_execute: https://github.com/lsst/ctrl_execute .. _condor_q: https://htcondor.readthedocs.io/en/latest/man-pages/condor_q.html .. _condor_rm: https://htcondor.readthedocs.io/en/latest/man-pages/condor_rm.html .. _lsst_distrib: https://github.com/lsst/lsst_distrib.git