QuantumBackedButler¶
-
class
lsst.daf.butler.
QuantumBackedButler
(predicted_inputs: Iterable[Union[int, uuid.UUID]], predicted_outputs: Iterable[Union[int, uuid.UUID]], dimensions: lsst.daf.butler.core.dimensions._universe.DimensionUniverse, datastore: lsst.daf.butler.core.datastore.Datastore, storageClasses: lsst.daf.butler.core.storageClass.StorageClassFactory)¶ Bases:
lsst.daf.butler.LimitedButler
An implementation of
LimitedButler
intended to back execution of a singleQuantum
.Parameters: - predicted_inputs :
Iterable
[DatasetId
] Dataset IDs for datasets that can can be read from this butler.
- predicted_outputs :
Iterable
[DatasetId
] Dataset IDs for datasets that can be stored in this butler.
- dimensions :
DimensionUniverse
Object managing all dimension definitions.
- datastore :
Datastore
Datastore to use for all dataset I/O and existence checks.
- storageClasses :
StorageClassFactory
Object managing all storage class definitions.
Notes
Most callers should use the
initialize
classmethod
to construct new instances instead of calling the constructor directly.QuantumBackedButler
uses a SQLite database internally, in order to reuse existingDatastoreRegistryBridge
andOpaqueTableStorage
implementations that rely SQLAlchemy. If implementations are added in the future that don’t rely on SQLAlchemy, it should be possible to swap them in by overriding the type arguments toinitialize
(though at present,QuantumBackedButler
would still create at least an in-memory SQLite database that would then go unused).`We imagine
QuantumBackedButler
being used during (at least) batch execution to captureDatastore
records and save them to per-quantum files, which are also a convenient place to store provenance for eventual upload to a SQL-backedRegistry
(onceRegistry
has tables to store provenance, that is). These per-quantum files can be written in two ways:- The SQLite file used internally by
QuantumBackedButler
can be used directly but customizing thefilename
argument toinitialize
, and then transferring that file to the object store after execution completes (or fails; atry/finally
pattern probably makes sense here). - A JSON or YAML file can be written by calling
extract_provenance_data
, and usingpydantic
methods to write the returnedQuantumProvenanceData
to a file.
Note that at present, the SQLite file only contains datastore records, not provenance, but that should be easy to address (if desired) after we actually design a
Registry
schema for provenance. I also suspect that we’ll want to explicitly close the SQLite file somehow before trying to transfer it. But I’m guessing we’d prefer to write the per-quantum files as JSON anyway.Attributes Summary
GENERATION
dimensions
Structure managing all dimensions recognized by this data repository ( DimensionUniverse
).Methods Summary
datasetExistsDirect
(ref)Return True
if a dataset is actually present in the Datastore.extract_provenance_data
()Extract provenance information and datastore records from this butler. from_predicted
(config, str], …)Construct a new QuantumBackedButler
from sets of input and output dataset IDs.getDirect
(ref, *, parameters, Any], …)Retrieve a stored dataset. getDirectDeferred
(ref, *, parameters, …)Create a DeferredDatasetHandle
which can later retrieve a dataset, from a resolvedDatasetRef
.initialize
(config, str], quantum, …)Construct a new QuantumBackedButler
from repository configuration and helper types.isWriteable
()Return True
if thisButler
supports write operations.markInputUnused
(ref)Indicate that a predicted input was not actually used when processing a Quantum
.pruneDatasets
(refs, *, disassociate, …)Remove one or more datasets from a collection and/or storage. putDirect
(obj, ref)Store a dataset that already has a UUID and RUN
collection.Attributes Documentation
-
GENERATION
= 3¶
-
dimensions
¶ Structure managing all dimensions recognized by this data repository (
DimensionUniverse
).
Methods Documentation
-
datasetExistsDirect
(ref: lsst.daf.butler.core.datasets.ref.DatasetRef) → bool¶ Return
True
if a dataset is actually present in the Datastore.Parameters: - ref :
DatasetRef
Resolved reference to a dataset.
Returns: - exists :
bool
Whether the dataset exists in the Datastore.
- ref :
-
extract_provenance_data
() → lsst.daf.butler._quantum_backed.QuantumProvenanceData¶ Extract provenance information and datastore records from this butler.
Returns: - provenance :
QuantumProvenanceData
A serializable struct containing input/output dataset IDs and datastore records. This assumes all dataset IDs are UUIDs (just to make it easier for
pydantic
to reason about the struct’s types); the rest of this class makes no such assumption, but the approach to processing in which it’s useful effectively requires UUIDs anyway.
Notes
QuantumBackedButler
records this provenance information when its methods are used, which mostly savesPipelineTask
authors from having to worry about while still recording very detailed information. But it has two small weaknesses:- Calling
getDirectDeferred
orgetDirect
is enough to mark a dataset as an “actual input”, which may mark some datasets that aren’t actually used. We rely on task authors to usemarkInputUnused
to address this. - We assume that the execution system will call
datasetExistsDirect
on all predicted inputs prior to execution, in order to populate the “available inputs” set. This is what I envision ‘SingleQuantumExecutor
doing after we update it to use this class, but it feels fragile for this class to make such a strong assumption about how it will be used, even if I can’t think of any other executor behavior that would make sense.
- provenance :
-
classmethod
from_predicted
(config: Union[lsst.daf.butler.core.config.Config, str], predicted_inputs: Iterable[Union[int, uuid.UUID]], predicted_outputs: Iterable[Union[int, uuid.UUID]], dimensions: lsst.daf.butler.core.dimensions._universe.DimensionUniverse, datastore_records: Mapping[str, lsst.daf.butler.core.datastoreRecordData.DatastoreRecordData], filename: str = ':memory:', OpaqueManagerClass: Type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] = <class 'lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager'>, BridgeManagerClass: Type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] = <class 'lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager'>, search_paths: Optional[List[str], None] = None) → lsst.daf.butler._quantum_backed.QuantumBackedButler¶ Construct a new
QuantumBackedButler
from sets of input and output dataset IDs.Parameters: - config :
Config
orstr
A butler repository root, configuration filename, or configuration instance.
- predicted_inputs :
Iterable
[DatasetId
] Dataset IDs for datasets that can can be read from this butler.
- predicted_outputs :
Iterable
[DatasetId
] Dataset IDs for datasets that can be stored in this butler, must be fully resolved.
- dimensions :
DimensionUniverse
Object managing all dimension definitions.
- filename :
str
, optional Name for the SQLite database that will back this butler; defaults to an in-memory database.
- datastore_records :
dict
[str
,DatastoreRecordData
] orNone
Datastore records to import into a datastore.
- OpaqueManagerClass :
type
, optional A subclass of
OpaqueTableStorageManager
to use for datastore opaque records. Default is a SQL-backed implementation.- BridgeManagerClass :
type
, optional A subclass of
DatastoreRegistryBridgeManager
to use for datastore location records. Default is a SQL-backed implementation.- search_paths :
list
ofstr
, optional Additional search paths for butler configuration.
- config :
-
getDirect
(ref: lsst.daf.butler.core.datasets.ref.DatasetRef, *, parameters: Optional[Dict[str, Any], None] = None, storageClass: Union[lsst.daf.butler.core.storageClass.StorageClass, str, None] = None) → Any¶ Retrieve a stored dataset.
Unlike
Butler.get
, this method allows datasets outside the Butler’s collection to be read as long as theDatasetRef
that identifies them can be obtained separately.Parameters: - ref :
DatasetRef
Resolved reference to an already stored dataset.
- parameters :
dict
Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.
- storageClass :
StorageClass
orstr
, optional The storage class to be used to override the Python type returned by this method. By default the returned type matches the dataset type definition for this dataset. Specifying a read
StorageClass
can force a different type to be returned. This type must be compatible with the original type.
Returns: - obj :
object
The dataset.
Raises: - AmbiguousDatasetError
Raised if
ref.id is None
, i.e. the reference is unresolved.
- ref :
-
getDirectDeferred
(ref: lsst.daf.butler.core.datasets.ref.DatasetRef, *, parameters: Optional[dict, None] = None, storageClass: Union[lsst.daf.butler.core.storageClass.StorageClass, str, None] = None) → lsst.daf.butler._deferredDatasetHandle.DeferredDatasetHandle¶ Create a
DeferredDatasetHandle
which can later retrieve a dataset, from a resolvedDatasetRef
.Parameters: - ref :
DatasetRef
Resolved reference to an already stored dataset.
- parameters :
dict
Additional StorageClass-defined options to control reading, typically used to efficiently read only a subset of the dataset.
- storageClass :
StorageClass
orstr
, optional The storage class to be used to override the Python type returned by this method. By default the returned type matches the dataset type definition for this dataset. Specifying a read
StorageClass
can force a different type to be returned. This type must be compatible with the original type.
Returns: - obj :
DeferredDatasetHandle
A handle which can be used to retrieve a dataset at a later time.
Raises: - AmbiguousDatasetError
Raised if
ref.id is None
, i.e. the reference is unresolved.
- ref :
-
classmethod
initialize
(config: Union[lsst.daf.butler.core.config.Config, str], quantum: lsst.daf.butler.core.quantum.Quantum, dimensions: lsst.daf.butler.core.dimensions._universe.DimensionUniverse, filename: str = ':memory:', OpaqueManagerClass: Type[lsst.daf.butler.registry.interfaces._opaque.OpaqueTableStorageManager] = <class 'lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager'>, BridgeManagerClass: Type[lsst.daf.butler.registry.interfaces._bridge.DatastoreRegistryBridgeManager] = <class 'lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager'>, search_paths: Optional[List[str], None] = None) → lsst.daf.butler._quantum_backed.QuantumBackedButler¶ Construct a new
QuantumBackedButler
from repository configuration and helper types.Parameters: - config :
Config
orstr
A butler repository root, configuration filename, or configuration instance.
- quantum :
Quantum
Object describing the predicted input and output dataset relevant to this butler. This must have resolved
DatasetRef
instances for all inputs and outputs.- dimensions :
DimensionUniverse
Object managing all dimension definitions.
- filename :
str
, optional Name for the SQLite database that will back this butler; defaults to an in-memory database.
- OpaqueManagerClass :
type
, optional A subclass of
OpaqueTableStorageManager
to use for datastore opaque records. Default is a SQL-backed implementation.- BridgeManagerClass :
type
, optional A subclass of
DatastoreRegistryBridgeManager
to use for datastore location records. Default is a SQL-backed implementation.- search_paths :
list
ofstr
, optional Additional search paths for butler configuration.
- config :
-
markInputUnused
(ref: lsst.daf.butler.core.datasets.ref.DatasetRef) → None¶ Indicate that a predicted input was not actually used when processing a
Quantum
.Parameters: - ref :
DatasetRef
Reference to the unused dataset.
Notes
By default, a dataset is considered “actually used” if it is accessed via
getDirect
or a handle to it is obtained viagetDirectDeferred
(even if the handle is not used). This method must be called after one of those in order to remove the dataset from the actual input list.This method does nothing for butlers that do not store provenance information (which is the default implementation provided by the base class).
- ref :
-
pruneDatasets
(refs: Iterable[lsst.daf.butler.core.datasets.ref.DatasetRef], *, disassociate: bool = True, unstore: bool = False, tags: Iterable[str] = (), purge: bool = False) → None¶ Remove one or more datasets from a collection and/or storage.
Parameters: - refs :
Iterable
ofDatasetRef
Datasets to prune. These must be “resolved” references (not just a
DatasetType
and data ID).- disassociate :
bool
, optional Disassociate pruned datasets from
tags
, or from all collections ifpurge=True
.- unstore :
bool
, optional If
True
(False
is default) remove these datasets from all datastores known to this butler. Note that this will make it impossible to retrieve these datasets even via other collections. Datasets that are already not stored are ignored by this option.- tags :
Iterable
[str
], optional TAGGED
collections to disassociate the datasets from. Ignored ifdisassociate
isFalse
orpurge
isTrue
.- purge :
bool
, optional If
True
(False
is default), completely remove the dataset from theRegistry
. To prevent accidental deletions,purge
may only beTrue
if all of the following conditions are met:This mode may remove provenance information from datasets other than those provided, and should be used with extreme care.
Raises: - TypeError
Raised if the butler is read-only, if no collection was provided, or the conditions for
purge=True
were not met.
- refs :
-
putDirect
(obj: Any, ref: lsst.daf.butler.core.datasets.ref.DatasetRef) → lsst.daf.butler.core.datasets.ref.DatasetRef¶ Store a dataset that already has a UUID and
RUN
collection.Parameters: - obj :
object
The dataset.
- ref :
DatasetRef
Resolved reference for a not-yet-stored dataset.
Returns: - ref :
DatasetRef
The same as the given, for convenience and symmetry with
Butler.put
.
Raises: - TypeError
Raised if the butler is read-only.
- AmbiguousDatasetError
Raised if
ref.id is None
, i.e. the reference is unresolved.
Notes
Whether this method inserts the given dataset into a
Registry
is implementation defined (someLimitedButler
subclasses do not have aRegistry
), but it always adds the dataset to aDatastore
, and the givenref.id
andref.run
are always preserved.- obj :
- predicted_inputs :