QuantumBackedButler¶
- class lsst.daf.butler.QuantumBackedButler(predicted_inputs: Iterable[UUID], predicted_outputs: Iterable[UUID], dimensions: DimensionUniverse, datastore: Datastore, storageClasses: StorageClassFactory, dataset_types: Mapping[str, DatasetType] | None = None)¶
Bases:
LimitedButlerAn implementation of
LimitedButlerintended to back execution of a singleQuantum.- Parameters:
- predicted_inputs
Iterable[DatasetId] Dataset IDs for datasets that can can be read from this butler.
- predicted_outputs
Iterable[DatasetId] Dataset IDs for datasets that can be stored in this butler.
- dimensions
DimensionUniverse Object managing all dimension definitions.
- datastore
Datastore Datastore to use for all dataset I/O and existence checks.
- storageClasses
StorageClassFactory Object managing all storage class definitions.
- dataset_types
Mapping[str,DatasetType] The registry dataset type definitions, indexed by name.
- predicted_inputs
Notes
Most callers should use the
initializeclassmethodto construct new instances instead of calling the constructor directly.QuantumBackedButleruses a SQLite database internally, in order to reuse existingDatastoreRegistryBridgeandOpaqueTableStorageimplementations that rely SQLAlchemy. If implementations are added in the future that don’t rely on SQLAlchemy, it should be possible to swap them in by overriding the type arguments toinitialize(though at present,QuantumBackedButlerwould still create at least an in-memory SQLite database that would then go unused).`We imagine
QuantumBackedButlerbeing used during (at least) batch execution to captureDatastorerecords and save them to per-quantum files, which are also a convenient place to store provenance for eventual upload to a SQL-backedRegistry(onceRegistryhas tables to store provenance, that is). These per-quantum files can be written in two ways:The SQLite file used internally by
QuantumBackedButlercan be used directly but customizing thefilenameargument toinitialize, and then transferring that file to the object store after execution completes (or fails; atry/finallypattern probably makes sense here).A JSON or YAML file can be written by calling
extract_provenance_data, and usingpydanticmethods to write the returnedQuantumProvenanceDatato a file.
Note that at present, the SQLite file only contains datastore records, not provenance, but that should be easy to address (if desired) after we actually design a
Registryschema for provenance. I also suspect that we’ll want to explicitly close the SQLite file somehow before trying to transfer it. But I’m guessing we’d prefer to write the per-quantum files as JSON anyway.Attributes Summary
Structure managing all dimensions recognized by this data repository (
DimensionUniverse).Methods Summary
Extract provenance information and datastore records from this butler.
from_predicted(config, predicted_inputs, ...)Construct a new
QuantumBackedButlerfrom sets of input and output dataset IDs.get(ref, /, *[, parameters, storageClass])Retrieve a stored dataset.
getDeferred(ref, /, *[, parameters, ...])Create a
DeferredDatasetHandlewhich can later retrieve a dataset, after an immediate registry lookup.initialize(config, quantum, dimensions[, ...])Construct a new
QuantumBackedButlerfrom repository configuration and helper types.markInputUnused(ref)Indicate that a predicted input was not actually used when processing a
Quantum.pruneDatasets(refs, *[, disassociate, ...])Remove one or more datasets from a collection and/or storage.
put(obj, ref, /)Store a dataset that already has a UUID and
RUNcollection.retrieve_artifacts(refs, destination[, ...])Retrieve the artifacts associated with the supplied refs.
retrieve_artifacts_zip(refs, destination[, ...])Retrieve artifacts from the graph and place in ZIP file.
stored(ref)Indicate whether the dataset's artifacts are present in the Datastore.
stored_many(refs)Check the datastore for artifact existence of multiple datasets at once.
Attributes Documentation
- dimensions¶
Methods Documentation
- extract_provenance_data() QuantumProvenanceData¶
Extract provenance information and datastore records from this butler.
- Returns:
- provenance
QuantumProvenanceData A serializable struct containing input/output dataset IDs and datastore records. This assumes all dataset IDs are UUIDs (just to make it easier for
pydanticto reason about the struct’s types); the rest of this class makes no such assumption, but the approach to processing in which it’s useful effectively requires UUIDs anyway.
- provenance
Notes
QuantumBackedButlerrecords this provenance information when its methods are used, which mostly savesPipelineTaskauthors from having to worry about while still recording very detailed information. But it has two small weaknesses:Calling
getDeferredorgetis enough to mark a dataset as an “actual input”, which may mark some datasets that aren’t actually used. We rely on task authors to usemarkInputUnusedto address this.We assume that the execution system will call
storedon all predicted inputs prior to execution, in order to populate the “available inputs” set. This is what I envision ‘SingleQuantumExecutordoing after we update it to use this class, but it feels fragile for this class to make such a strong assumption about how it will be used, even if I can’t think of any other executor behavior that would make sense.
- classmethod from_predicted(config: ~lsst.daf.butler._config.Config | str | ~urllib.parse.ParseResult | ~lsst.resources._resourcePath.ResourcePath | ~pathlib.Path, predicted_inputs: ~collections.abc.Iterable[~uuid.UUID], predicted_outputs: ~collections.abc.Iterable[~uuid.UUID], dimensions: ~lsst.daf.butler.dimensions._universe.DimensionUniverse, datastore_records: ~collections.abc.Mapping[str, ~lsst.daf.butler.datastore.record_data.DatastoreRecordData], filename: str = ':memory:', OpaqueManagerClass: type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] = <class 'lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager'>, BridgeManagerClass: type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] = <class 'lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager'>, search_paths: list[str] | None = None, dataset_types: ~collections.abc.Mapping[str, ~lsst.daf.butler._dataset_type.DatasetType] | None = None) QuantumBackedButler¶
Construct a new
QuantumBackedButlerfrom sets of input and output dataset IDs.- Parameters:
- config
ConfigorResourcePathExpression A butler repository root, configuration filename, or configuration instance.
- predicted_inputs
Iterable[DatasetId] Dataset IDs for datasets that can can be read from this butler.
- predicted_outputs
Iterable[DatasetId] Dataset IDs for datasets that can be stored in this butler, must be fully resolved.
- dimensions
DimensionUniverse Object managing all dimension definitions.
- datastore_records
dict[str,DatastoreRecordData] orNone Datastore records to import into a datastore.
- filename
str, optional Name for the SQLite database that will back this butler; defaults to an in-memory database.
- OpaqueManagerClass
type, optional A subclass of
OpaqueTableStorageManagerto use for datastore opaque records. Default is a SQL-backed implementation.- BridgeManagerClass
type, optional A subclass of
DatastoreRegistryBridgeManagerto use for datastore location records. Default is a SQL-backed implementation.- search_paths
listofstr, optional Additional search paths for butler configuration.
- dataset_types
Mapping[str,DatasetType], optional Mapping of the dataset type name to its registry definition.
- config
- get(ref: DatasetRef, /, *, parameters: dict[str, Any] | None = None, storageClass: StorageClass | str | None = None) Any¶
Retrieve a stored dataset.
- Parameters:
- ref
DatasetRef A resolved
DatasetRefdirectly associated with a dataset.- parameters
dict Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.
- storageClass
StorageClassorstr, optional The storage class to be used to override the Python type returned by this method. By default the returned type matches the dataset type definition for this dataset. Specifying a read
StorageClasscan force a different type to be returned. This type must be compatible with the original type.
- ref
- Returns:
- obj
object The dataset.
- obj
- Raises:
- AmbiguousDatasetError
Raised if the supplied
DatasetRefis unresolved.
Notes
In a
LimitedButlerthe only allowable way to specify a dataset is to use a resolvedDatasetRef. Subclasses can support more options.
- getDeferred(ref: DatasetRef, /, *, parameters: dict[str, Any] | None = None, storageClass: str | StorageClass | None = None) DeferredDatasetHandle¶
Create a
DeferredDatasetHandlewhich can later retrieve a dataset, after an immediate registry lookup.- Parameters:
- ref
DatasetRef For the default implementation of a
LimitedButler, the only acceptable parameter is a resolvedDatasetRef.- parameters
dict Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.
- storageClass
StorageClassorstr, optional The storage class to be used to override the Python type returned by this method. By default the returned type matches the dataset type definition for this dataset. Specifying a read
StorageClasscan force a different type to be returned. This type must be compatible with the original type.
- ref
- Returns:
- obj
DeferredDatasetHandle A handle which can be used to retrieve a dataset at a later time.
- obj
Notes
In a
LimitedButlerthe only allowable way to specify a dataset is to use a resolvedDatasetRef. Subclasses can support more options.
- classmethod initialize(config: ~lsst.daf.butler._config.Config | str | ~urllib.parse.ParseResult | ~lsst.resources._resourcePath.ResourcePath | ~pathlib.Path, quantum: ~lsst.daf.butler._quantum.Quantum, dimensions: ~lsst.daf.butler.dimensions._universe.DimensionUniverse, filename: str = ':memory:', OpaqueManagerClass: type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] = <class 'lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager'>, BridgeManagerClass: type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] = <class 'lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager'>, search_paths: list[str] | None = None, dataset_types: ~collections.abc.Mapping[str, ~lsst.daf.butler._dataset_type.DatasetType] | None = None) QuantumBackedButler¶
Construct a new
QuantumBackedButlerfrom repository configuration and helper types.- Parameters:
- config
ConfigorResourcePathExpression A butler repository root, configuration filename, or configuration instance.
- quantum
Quantum Object describing the predicted input and output dataset relevant to this butler. This must have resolved
DatasetRefinstances for all inputs and outputs.- dimensions
DimensionUniverse Object managing all dimension definitions.
- filename
str, optional Name for the SQLite database that will back this butler; defaults to an in-memory database.
- OpaqueManagerClass
type, optional A subclass of
OpaqueTableStorageManagerto use for datastore opaque records. Default is a SQL-backed implementation.- BridgeManagerClass
type, optional A subclass of
DatastoreRegistryBridgeManagerto use for datastore location records. Default is a SQL-backed implementation.- search_paths
listofstr, optional Additional search paths for butler configuration.
- dataset_types
Mapping[str,DatasetType], optional Mapping of the dataset type name to its registry definition.
- config
- markInputUnused(ref: DatasetRef) None¶
Indicate that a predicted input was not actually used when processing a
Quantum.- Parameters:
- ref
DatasetRef Reference to the unused dataset.
- ref
Notes
By default, a dataset is considered “actually used” if it is accessed via
getor a handle to it is obtained viagetDeferred(even if the handle is not used). This method must be called after one of those in order to remove the dataset from the actual input list.This method does nothing for butlers that do not store provenance information (which is the default implementation provided by the base class).
- pruneDatasets(refs: Iterable[DatasetRef], *, disassociate: bool = True, unstore: bool = False, tags: Iterable[str] = (), purge: bool = False) None¶
Remove one or more datasets from a collection and/or storage.
- Parameters:
- refs
IterableofDatasetRef Datasets to prune. These must be “resolved” references (not just a
DatasetTypeand data ID).- disassociate
bool, optional Disassociate pruned datasets from
tags, or from all collections ifpurge=True.- unstore
bool, optional If
True(Falseis default) remove these datasets from all datastores known to this butler. Note that this will make it impossible to retrieve these datasets even via other collections. Datasets that are already not stored are ignored by this option.- tags
Iterable[str], optional TAGGEDcollections to disassociate the datasets from. Ignored ifdisassociateisFalseorpurgeisTrue.- purge
bool, optional If
True(Falseis default), completely remove the dataset from theRegistry. To prevent accidental deletions,purgemay only beTrueif all of the following conditions are met:This mode may remove provenance information from datasets other than those provided, and should be used with extreme care.
- refs
- Raises:
- TypeError
Raised if the butler is read-only, if no collection was provided, or the conditions for
purge=Truewere not met.
- put(obj: Any, ref: DatasetRef, /) DatasetRef¶
Store a dataset that already has a UUID and
RUNcollection.- Parameters:
- obj
object The dataset.
- ref
DatasetRef Resolved reference for a not-yet-stored dataset.
- obj
- Returns:
- ref
DatasetRef The same as the given, for convenience and symmetry with
Butler.put.
- ref
- Raises:
- TypeError
Raised if the butler is read-only.
Notes
Whether this method inserts the given dataset into a
Registryis implementation defined (someLimitedButlersubclasses do not have aRegistry), but it always adds the dataset to aDatastore, and the givenref.idandref.runare always preserved.
- retrieve_artifacts(refs: Iterable[DatasetRef], destination: str | ParseResult | ResourcePath | Path, transfer: str = 'auto', preserve_path: bool = True, overwrite: bool = False) list[lsst.resources._resourcePath.ResourcePath]¶
Retrieve the artifacts associated with the supplied refs.
- Parameters:
- refsiterable of
DatasetRef The datasets for which artifacts are to be retrieved. A single ref can result in multiple artifacts. The refs must be resolved.
- destination
lsst.resources.ResourcePathorstr Location to write the artifacts.
- transfer
str, optional Method to use to transfer the artifacts. Must be one of the options supported by
transfer_from(). “move” is not allowed.- preserve_path
bool, optional If
Truethe full path of the artifact within the datastore is preserved. IfFalsethe final file component of the path is used.- overwrite
bool, optional If
Trueallow transfers to overwrite existing files at the destination.
- refsiterable of
- Returns:
- targets
listoflsst.resources.ResourcePath URIs of file artifacts in destination location. Order is not preserved.
- targets
- retrieve_artifacts_zip(refs: Iterable[DatasetRef], destination: str | ParseResult | ResourcePath | Path, overwrite: bool = True) ResourcePath¶
Retrieve artifacts from the graph and place in ZIP file.
- Parameters:
- refs
Iterable[DatasetRef] The datasets to be included in the zip file.
- destination
lsst.resources.ResourcePathExpression Directory to write the new ZIP file. This directory will also be used as a staging area for the datasets being downloaded from the datastore.
- overwrite
bool, optional If
Falsethe output Zip will not be written if a file of the same name is already present indestination.
- refs
- Returns:
- zip_file
lsst.resources.ResourcePath The path to the new ZIP file.
- zip_file
- Raises:
- ValueError
Raised if there are no refs to retrieve.
- stored(ref: DatasetRef) bool¶
Indicate whether the dataset’s artifacts are present in the Datastore.
- Parameters:
- ref
DatasetRef Resolved reference to a dataset.
- ref
- Returns:
- stored
bool Whether the dataset artifact exists in the datastore and can be retrieved.
- stored
- stored_many(refs: Iterable[DatasetRef]) dict[lsst.daf.butler._dataset_ref.DatasetRef, bool]¶
Check the datastore for artifact existence of multiple datasets at once.
- Parameters:
- refsiterable of
DatasetRef The datasets to be checked.
- refsiterable of
- Returns:
- existence
dictof [DatasetRef,bool] Mapping from given dataset refs to boolean indicating artifact existence.
- existence