QuantumBackedButler

class lsst.daf.butler.QuantumBackedButler(quantum: lsst.daf.butler.core.quantum.Quantum, dimensions: lsst.daf.butler.core.dimensions._universe.DimensionUniverse, datastore: lsst.daf.butler.core.datastore.Datastore, storageClasses: lsst.daf.butler.core.storageClass.StorageClassFactory)

Bases: lsst.daf.butler._limited_butler.LimitedButler

An implementation of LimitedButler intended to back execution of a single Quantum.

Parameters:
quantum : Quantum

Object describing the predicted input and output dataset relevant to this butler. This must have resolved DatasetRef instances for all inputs and outputs.

dimensions : DimensionUniverse

Object managing all dimension definitions.

datastore : Datastore

Datastore to use for all dataset I/O and existence checks.

storageClasses : StorageClassFactory

Object managing all storage class definitions.

Notes

Most callers should use the initialize classmethod to construct new instances instead of calling the constructor directly.

QuantumBackedButler uses a SQLite database internally, in order to reuse existing DatastoreRegistryBridge and OpaqueTableStorage implementations that rely SQLAlchemy. If implementations are added in the future that don’t rely on SQLAlchemy, it should be possible to swap them in by overriding the type arguments to initialize (though at present, QuantumBackedButler would still create at least an in-memory SQLite database that would then go unused).`

We imagine QuantumBackedButler being used during (at least) batch execution to capture Datastore records and save them to per-quantum files, which are also a convenient place to store provenance for eventual upload to a SQL-backed Registry (once Registry has tables to store provenance, that is). These per-quantum files can be written in two ways:

  • The SQLite file used internally by QuantumBackedButler can be used directly but customizing the filename argument to initialize, and then transferring that file to the object store after execution completes (or fails; a try/finally pattern probably makes sense here).
  • A JSON or YAML file can be written by calling extract_provenance_data, and using pydantic methods to write the returned QuantumProvenanceData to a file.

Note that at present, the SQLite file only contains datastore records, not provenance, but that should be easy to address (if desired) after we actually design a Registry schema for provenance. I also suspect that we’ll want to explicitly close the SQLite file somehow before trying to transfer it. But I’m guessing we’d prefer to write the per-quantum files as JSON anyway.

Attributes Summary

GENERATION
dimensions Structure managing all dimensions recognized by this data repository (DimensionUniverse).

Methods Summary

datasetExistsDirect(ref) Return True if a dataset is actually present in the Datastore.
extract_provenance_data() Extract provenance information and datastore records from this butler.
getDirect(ref, *, parameters, Any]] = None) Retrieve a stored dataset.
getDirectDeferred(ref, *, parameters) Create a DeferredDatasetHandle which can later retrieve a dataset, from a resolved DatasetRef.
initialize(config, str], quantum, …) Construct a new QuantumBackedButler from repository configuration and helper types.
isWriteable() Return True if this Butler supports write operations.
markInputUnused(ref) Indicate that a predicted input was not actually used when processing a Quantum.
putDirect(obj, ref) Store a dataset that already has a UUID and RUN collection.

Attributes Documentation

GENERATION = 3
dimensions

Structure managing all dimensions recognized by this data repository (DimensionUniverse).

Methods Documentation

datasetExistsDirect(ref: lsst.daf.butler.core.datasets.ref.DatasetRef) → bool

Return True if a dataset is actually present in the Datastore.

Parameters:
ref : DatasetRef

Resolved reference to a dataset.

Returns:
exists : bool

Whether the dataset exists in the Datastore.

extract_provenance_data() → lsst.daf.butler._quantum_backed.QuantumProvenanceData

Extract provenance information and datastore records from this butler.

Returns:
provenance : QuantumProvenanceData

A serializable struct containing input/output dataset IDs and datastore records. This assumes all dataset IDs are UUIDs (just to make it easier for pydantic to reason about the struct’s types); the rest of this class makes no such assumption, but the approach to processing in which it’s useful effectively requires UUIDs anyway.

Notes

QuantumBackedButler records this provenance information when its methods are used, which mostly saves PipelineTask authors from having to worry about while still recording very detailed information. But it has two small weaknesses:

  • Calling getDirectDeferred or getDirect is enough to mark a dataset as an “actual input”, which may mark some datasets that aren’t actually used. We rely on task authors to use markInputUnused to address this.
  • We assume that the execution system will call datasetExistsDirect on all predicted inputs prior to execution, in order to populate the “available inputs” set. This is what I envision ‘SingleQuantumExecutor doing after we update it to use this class, but it feels fragile for this class to make such a strong assumption about how it will be used, even if I can’t think of any other executor behavior that would make sense.
getDirect(ref: lsst.daf.butler.core.datasets.ref.DatasetRef, *, parameters: Optional[Dict[str, Any]] = None) → Any

Retrieve a stored dataset.

Unlike Butler.get, this method allows datasets outside the Butler’s collection to be read as long as the DatasetRef that identifies them can be obtained separately.

Parameters:
ref : DatasetRef

Resolved reference to an already stored dataset.

parameters : dict

Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.

Returns:
obj : object

The dataset.

Raises:
AmbiguousDatasetError

Raised if ref.id is None, i.e. the reference is unresolved.

getDirectDeferred(ref: lsst.daf.butler.core.datasets.ref.DatasetRef, *, parameters: Optional[dict] = None) → lsst.daf.butler._deferredDatasetHandle.DeferredDatasetHandle

Create a DeferredDatasetHandle which can later retrieve a dataset, from a resolved DatasetRef.

Parameters:
ref : DatasetRef

Resolved reference to an already stored dataset.

parameters : dict

Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.

Returns:
obj : DeferredDatasetHandle

A handle which can be used to retrieve a dataset at a later time.

Raises:
AmbiguousDatasetError

Raised if ref.id is None, i.e. the reference is unresolved.

classmethod initialize(config: Union[lsst.daf.butler.core.config.Config, str], quantum: lsst.daf.butler.core.quantum.Quantum, dimensions: lsst.daf.butler.core.dimensions._universe.DimensionUniverse, filename: str = ':memory:', OpaqueManagerClass: Type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] = <class 'lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager'>, BridgeManagerClass: Type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] = <class 'lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager'>, search_paths: Optional[List[str]] = None) → lsst.daf.butler._quantum_backed.QuantumBackedButler

Construct a new QuantumBackedButler from repository configuration and helper types.

Parameters:
config : Config or str

A butler repository root, configuration filename, or configuration instance.

quantum : Quantum

Object describing the predicted input and output dataset relevant to this butler. This must have resolved DatasetRef instances for all inputs and outputs.

dimensions : DimensionUniverse

Object managing all dimension definitions.

filename : str, optional

Name for the SQLite database that will back this butler; defaults to an in-memory database.

OpaqueManagerClass : type, optional

A subclass of OpaqueTableStorageManager to use for datastore opaque records. Default is a SQL-backed implementation.

BridgeManagerClass : type, optional

A subclass of DatastoreRegistryBridgeManager to use for datastore location records. Default is a SQL-backed implementation.

search_paths : list of str, optional

Additional search paths for butler configuration.

isWriteable() → bool

Return True if this Butler supports write operations.

markInputUnused(ref: lsst.daf.butler.core.datasets.ref.DatasetRef) → None

Indicate that a predicted input was not actually used when processing a Quantum.

Parameters:
ref : DatasetRef

Reference to the unused dataset.

Notes

By default, a dataset is considered “actually used” if it is accessed via getDirect or a handle to it is obtained via getDirectDeferred (even if the handle is not used). This method must be called after one of those in order to remove the dataset from the actual input list.

This method does nothing for butlers that do not store provenance information (which is the default implementation provided by the base class).

putDirect(obj: Any, ref: lsst.daf.butler.core.datasets.ref.DatasetRef) → lsst.daf.butler.core.datasets.ref.DatasetRef

Store a dataset that already has a UUID and RUN collection.

Parameters:
obj : object

The dataset.

ref : DatasetRef

Resolved reference for a not-yet-stored dataset.

Returns:
ref : DatasetRef

The same as the given, for convenience and symmetry with Butler.put.

Raises:
TypeError

Raised if the butler is read-only.

AmbiguousDatasetError

Raised if ref.id is None, i.e. the reference is unresolved.

Notes

Whether this method inserts the given dataset into a Registry is implementation defined (some LimitedButler subclasses do not have a Registry), but it always adds the dataset to a Datastore, and the given ref.id and ref.run are always preserved.