QuantumBackedButler¶

class lsst.daf.butler.QuantumBackedButler(predicted_inputs: Iterable[UUID], predicted_outputs: Iterable[UUID], dimensions: DimensionUniverse, datastore: Datastore, storageClasses: StorageClassFactory, dataset_types: Mapping[str, DatasetType] | None = None, metrics: ButlerMetrics | None = None)¶

Bases: LimitedButler

An implementation of LimitedButler intended to back execution of a single Quantum.

Parameters:

predicted_inputsIterable [DatasetId]: Dataset IDs for datasets that can can be read from this butler.
predicted_outputsIterable [DatasetId]: Dataset IDs for datasets that can be stored in this butler.
dimensionsDimensionUniverse: Object managing all dimension definitions.
datastoreDatastore: Datastore to use for all dataset I/O and existence checks.
storageClassesStorageClassFactory: Object managing all storage class definitions.
dataset_typesMapping [str, DatasetType]: The registry dataset type definitions, indexed by name.
metricslsst.daf.butler.ButlerMetrics or None, optional: Metrics object for tracking butler statistics.

Notes

Most callers should use the initialize classmethod to construct new instances instead of calling the constructor directly.

QuantumBackedButler uses a SQLite database internally, in order to reuse existing DatastoreRegistryBridge and OpaqueTableStorage implementations that rely SQLAlchemy. If implementations are added in the future that don’t rely on SQLAlchemy, it should be possible to swap them in by overriding the type arguments to initialize (though at present, QuantumBackedButler would still create at least an in-memory SQLite database that would then go unused).`

We imagine QuantumBackedButler being used during (at least) batch execution to capture Datastore records and save them to per-quantum files, which are also a convenient place to store provenance for eventual upload to a SQL-backed Registry (once Registry has tables to store provenance, that is). These per-quantum files can be written in two ways:

The SQLite file used internally by QuantumBackedButler can be used directly but customizing the filename argument to initialize, and then transferring that file to the object store after execution completes (or fails; a try/finally pattern probably makes sense here).
A JSON or YAML file can be written by calling extract_provenance_data, and using pydantic methods to write the returned QuantumProvenanceData to a file.

Note that at present, the SQLite file only contains datastore records, not provenance, but that should be easy to address (if desired) after we actually design a Registry schema for provenance. I also suspect that we’ll want to explicitly close the SQLite file somehow before trying to transfer it. But I’m guessing we’d prefer to write the per-quantum files as JSON anyway.

Attributes Summary

dimensions

Structure managing all dimensions recognized by this data repository (DimensionUniverse).

Methods Summary

`export_predicted_datastore_records`(refs)	Export datastore records for a set of predicted output dataset references.
`extract_provenance_data`()	Extract provenance information and datastore records from this butler.
`from_predicted`(config, predicted_inputs, ...)	Construct a new `QuantumBackedButler` from sets of input and output dataset IDs.
`get`(ref, /, *[, parameters, storageClass])	Retrieve a stored dataset.
`getDeferred`(ref, /, *[, parameters, ...])	Create a `DeferredDatasetHandle` which can later retrieve a dataset, after an immediate registry lookup.
`initialize`(config, quantum, dimensions[, ...])	Construct a new `QuantumBackedButler` from repository configuration and helper types.
`isWriteable`()	Return `True` if this `Butler` supports write operations.
`markInputUnused`(ref)	Indicate that a predicted input was not actually used when processing a `Quantum`.
`pruneDatasets`(refs, *[, disassociate, ...])	Remove one or more datasets from a collection and/or storage.
`put`(obj, ref, /, *[, provenance])	Store a dataset that already has a UUID and `RUN` collection.
`retrieve_artifacts`(refs, destination[, ...])	Retrieve the artifacts associated with the supplied refs.
`retrieve_artifacts_zip`(refs, destination[, ...])	Retrieve artifacts from the graph and place in ZIP file.
`stored`(ref)	Indicate whether the dataset's artifacts are present in the Datastore.
`stored_many`(refs)	Check the datastore for artifact existence of multiple datasets at once.

Attributes Documentation

dimensions¶

Methods Documentation

export_predicted_datastore_records(refs: Iterable[DatasetRef]) → dict[str, lsst.daf.butler.datastore.record_data.DatastoreRecordData]¶

Export datastore records for a set of predicted output dataset references.

Parameters:

refsIterable [ DatasetRef ]: Dataset references for which to export datastore records. These refs must be known to this butler.

Returns:

recordsdict [ str, DatastoreRecordData ]: Predicted datastore records indexed by datastore name. No attempt is made to ensure that the associated datasets exist on disk.

extract_provenance_data() → QuantumProvenanceData¶

Extract provenance information and datastore records from this butler.

Returns:

provenanceQuantumProvenanceData: A serializable struct containing input/output dataset IDs and datastore records. This assumes all dataset IDs are UUIDs (just to make it easier for pydantic to reason about the struct’s types); the rest of this class makes no such assumption, but the approach to processing in which it’s useful effectively requires UUIDs anyway.

Notes

QuantumBackedButler records this provenance information when its methods are used, which mostly saves PipelineTask authors from having to worry about while still recording very detailed information. But it has two small weaknesses:

Calling getDeferred or get is enough to mark a dataset as an “actual input”, which may mark some datasets that aren’t actually used. We rely on task authors to use markInputUnused to address this.
We assume that the execution system will call stored on all predicted inputs prior to execution, in order to populate the “available inputs” set. This is what I envision ‘SingleQuantumExecutor doing after we update it to use this class, but it feels fragile for this class to make such a strong assumption about how it will be used, even if I can’t think of any other executor behavior that would make sense.

classmethod from_predicted(config: Config | str | ParseResult | ResourcePath | Path, predicted_inputs: Iterable[UUID], predicted_outputs: Iterable[UUID], dimensions: DimensionUniverse, datastore_records: Mapping[str, DatastoreRecordData], filename: str | None = None, OpaqueManagerClass: type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] | None = None, BridgeManagerClass: type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] | None = None, search_paths: list[str] | None = None, dataset_types: Mapping[str, DatasetType] | None = None, metrics: ButlerMetrics | None = None) → QuantumBackedButler¶

Construct a new QuantumBackedButler from sets of input and output dataset IDs.

Parameters:

configConfig or ResourcePathExpression: A butler repository root, configuration filename, or configuration instance.
predicted_inputsIterable [DatasetId]: Dataset IDs for datasets that can can be read from this butler.
predicted_outputsIterable [DatasetId]: Dataset IDs for datasets that can be stored in this butler, must be fully resolved.
dimensionsDimensionUniverse: Object managing all dimension definitions.
datastore_recordsdict [str, DatastoreRecordData] or None: Datastore records to import into a datastore.
filenamestr, optional: Name for the SQLite database that will back this butler; defaults to an in-memory database.
OpaqueManagerClasstype, optional: A subclass of OpaqueTableStorageManager to use for datastore opaque records. Default is a SQL-backed implementation.
BridgeManagerClasstype, optional: A subclass of DatastoreRegistryBridgeManager to use for datastore location records. Default is a SQL-backed implementation.
search_pathslist of str, optional: Additional search paths for butler configuration.
dataset_typesMapping [str, DatasetType], optional: Mapping of the dataset type name to its registry definition.
metricslsst.daf.butler.ButlerMetrics or None, optional: Metrics object for gathering butler statistics.

get(ref: DatasetRef, /, *, parameters: dict[str, Any] | None = None, storageClass: StorageClass | str | None = None) → Any¶

Retrieve a stored dataset.

Parameters:

refDatasetRef: A resolved DatasetRef directly associated with a dataset.
parametersdict: Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.
storageClassStorageClass or str, optional: The storage class to be used to override the Python type returned by this method. By default the returned type matches the dataset type definition for this dataset. Specifying a read StorageClass can force a different type to be returned. This type must be compatible with the original type.

Returns:

objobject: The dataset.

Notes

In a LimitedButler the only allowable way to specify a dataset is to use a resolved DatasetRef. Subclasses can support more options.

getDeferred(ref: DatasetRef, /, *, parameters: dict[str, Any] | None = None, storageClass: str | StorageClass | None = None) → DeferredDatasetHandle¶

Create a DeferredDatasetHandle which can later retrieve a dataset, after an immediate registry lookup.

Parameters:

refDatasetRef: For the default implementation of a LimitedButler, the only acceptable parameter is a resolved DatasetRef.
parametersdict: Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.
storageClassStorageClass or str, optional: The storage class to be used to override the Python type returned by this method. By default the returned type matches the dataset type definition for this dataset. Specifying a read StorageClass can force a different type to be returned. This type must be compatible with the original type.

Returns:

objDeferredDatasetHandle: A handle which can be used to retrieve a dataset at a later time.

Notes

In a LimitedButler the only allowable way to specify a dataset is to use a resolved DatasetRef. Subclasses can support more options.

classmethod initialize(config: Config | str | ParseResult | ResourcePath | Path, quantum: Quantum, dimensions: DimensionUniverse, filename: str | None = None, OpaqueManagerClass: type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] | None = None, BridgeManagerClass: type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] | None = None, search_paths: list[str] | None = None, dataset_types: Mapping[str, DatasetType] | None = None, metrics: ButlerMetrics | None = None) → QuantumBackedButler¶

Construct a new QuantumBackedButler from repository configuration and helper types.

Parameters:

configConfig or ResourcePathExpression: A butler repository root, configuration filename, or configuration instance.
quantumQuantum: Object describing the predicted input and output dataset relevant to this butler. This must have resolved DatasetRef instances for all inputs and outputs.
dimensionsDimensionUniverse: Object managing all dimension definitions.
filenamestr, optional: Name for the SQLite database that will back this butler; defaults to an in-memory database.
OpaqueManagerClasstype, optional: A subclass of OpaqueTableStorageManager to use for datastore opaque records. Default is a SQL-backed implementation.
BridgeManagerClasstype, optional: A subclass of DatastoreRegistryBridgeManager to use for datastore location records. Default is a SQL-backed implementation.
search_pathslist of str, optional: Additional search paths for butler configuration.
dataset_typesMapping [str, DatasetType], optional: Mapping of the dataset type name to its registry definition.
metricslsst.daf.butler.ButlerMetrics or None, optional: Metrics object for gathering butler statistics.

isWriteable() → bool¶: Return True if this Butler supports write operations.

markInputUnused(ref: DatasetRef) → None¶

Indicate that a predicted input was not actually used when processing a Quantum.

Parameters:

refDatasetRef: Reference to the unused dataset.

Notes

By default, a dataset is considered “actually used” if it is accessed via get or a handle to it is obtained via getDeferred (even if the handle is not used). This method must be called after one of those in order to remove the dataset from the actual input list.

This method does nothing for butlers that do not store provenance information (which is the default implementation provided by the base class).

pruneDatasets(refs: Iterable[DatasetRef], *, disassociate: bool = True, unstore: bool = False, tags: Iterable[str] = (), purge: bool = False) → None¶

Remove one or more datasets from a collection and/or storage.

Parameters:

refsIterable of DatasetRef

Datasets to prune. These must be “resolved” references (not just a DatasetType and data ID).

disassociatebool, optional

Disassociate pruned datasets from tags, or from all collections if purge=True.

unstorebool, optional

If True (False is default) remove these datasets from all datastores known to this butler. Note that this will make it impossible to retrieve these datasets even via other collections. Datasets that are already not stored are ignored by this option.

tagsIterable [ str ], optional

TAGGED collections to disassociate the datasets from. Ignored if disassociate is False or purge is True.

purgebool, optional

If True (False is default), completely remove the dataset from the Registry. To prevent accidental deletions, purge may only be True if all of the following conditions are met:

disassociate is True;

unstore is True.

This mode may remove provenance information from datasets other than those provided, and should be used with extreme care.

Raises:

TypeError: Raised if the butler is read-only, if no collection was provided, or the conditions for purge=True were not met.

put(obj: Any, ref: DatasetRef, /, *, provenance: DatasetProvenance | None = None) → DatasetRef¶

Store a dataset that already has a UUID and RUN collection.

Parameters:

objobject: The dataset.
refDatasetRef: Resolved reference for a not-yet-stored dataset.
provenanceDatasetProvenance or None, optional: Any provenance that should be attached to the serialized dataset. Not supported by all serialization mechanisms.

Returns:

refDatasetRef: The same as the given, for convenience and symmetry with Butler.put.

Raises:

TypeError: Raised if the butler is read-only.

Notes

Whether this method inserts the given dataset into a Registry is implementation defined (some LimitedButler subclasses do not have a Registry), but it always adds the dataset to a Datastore, and the given ref.id and ref.run are always preserved.

retrieve_artifacts(refs: Iterable[DatasetRef], destination: str | ParseResult | ResourcePath | Path, transfer: str = 'auto', preserve_path: bool = True, overwrite: bool = False) → list[lsst.resources._resourcePath.ResourcePath]¶

Retrieve the artifacts associated with the supplied refs.

Parameters:

refsiterable of DatasetRef: The datasets for which artifacts are to be retrieved. A single ref can result in multiple artifacts. The refs must be resolved.
destinationlsst.resources.ResourcePath or str: Location to write the artifacts.
transferstr, optional: Method to use to transfer the artifacts. Must be one of the options supported by transfer_from(). “move” is not allowed.
preserve_pathbool, optional: If True the full path of the artifact within the datastore is preserved. If False the final file component of the path is used.
overwritebool, optional: If True allow transfers to overwrite existing files at the destination.

Returns:

targetslist of lsst.resources.ResourcePath: URIs of file artifacts in destination location. Order is not preserved.

retrieve_artifacts_zip(refs: Iterable[DatasetRef], destination: str | ParseResult | ResourcePath | Path, overwrite: bool = True) → ResourcePath¶

Retrieve artifacts from the graph and place in ZIP file.

Parameters:

refsIterable [ DatasetRef ]: The datasets to be included in the zip file.
destinationlsst.resources.ResourcePathExpression: Directory to write the new ZIP file. This directory will also be used as a staging area for the datasets being downloaded from the datastore.
overwritebool, optional: If False the output Zip will not be written if a file of the same name is already present in destination.

Returns:

zip_filelsst.resources.ResourcePath: The path to the new ZIP file.

Raises:

ValueError: Raised if there are no refs to retrieve.

stored(ref: DatasetRef) → bool¶

Indicate whether the dataset’s artifacts are present in the Datastore.

Parameters:

refDatasetRef: Resolved reference to a dataset.

Returns:

storedbool: Whether the dataset artifact exists in the datastore and can be retrieved.

stored_many(refs: Iterable[DatasetRef]) → dict[lsst.daf.butler._dataset_ref.DatasetRef, bool]¶

Check the datastore for artifact existence of multiple datasets at once.

Parameters:

refsiterable of DatasetRef: The datasets to be checked.

Returns:

existencedict of [DatasetRef, bool]: Mapping from given dataset refs to boolean indicating artifact existence.

Navigation

QuantumBackedButler¶