Tiger#

class lsst.ctrl.bps.parsl.sites.princeton.Tiger(*args, **kwargs)#

Bases: Slurm

Configuration for running jobs on Princeton’s Tiger cluster.

The following BPS configuration parameters are recognised, overriding the defaults:

  • nodes (int): number of nodes for each Slurm job.

  • cores_per_node (int): number of cores per node for each Slurm job.

  • walltime (str): time limit for each Slurm job.

  • mem_per_node (int): memory per node (GB) for each Slurm job.

  • max_blocks (int): maximum number of blocks (Slurm jobs) to use.

  • cmd_timeout (int): timeout (seconds) to wait for a scheduler.

  • singleton (bool): allow only one job to run at a time; by default True.

When running on the Tiger cluster, you should operate on the /scratch/gpfs filesystem, rather than /projects or /tigress; the latter are not even mounted on the cluster nodes any more.

Methods Summary

get_address()

Return the IP address of the machine hosting the driver/submission.

get_executors()

Get a list of executors to be used in processing.

select_executor(job)

Get the label of the executor to use to execute a job.

Methods Documentation

get_address() str#

Return the IP address of the machine hosting the driver/submission.

This host machine address should be accessible from the workers and should generally be the return value of one of the functions in parsl.addresses.

This is used by the default implementation of get_monitor, but will generally be used by get_executors too.

This implementation gets the address from the Infiniband network interface, because the cluster nodes can’t connect to the head node through the regular internet.

get_executors() list[ParslExecutor]#

Get a list of executors to be used in processing.

Each executor should have a unique label.

The walltime default here is set so we get into the tiger-vshort QoS, which will hopefully reduce the wait for us to get a node. Then, we have one Slurm job running at a time (singleton) while another saves a spot in line (max_blocks=2). We hope that this will allow us to run almost continually until the workflow is done.

We set the cmd_timeout value to 300 seconds to help avoid TimeoutExpired errors when commands are slow to return (often due to system contention).

select_executor(job: ParslJob) str#

Get the label of the executor to use to execute a job.

Parameters#

jobParslJob

Job to be executed.

Returns#

labelstr

Label of executor to use to execute job.