Storage Classes, Storage Class Delegates, and Formatters

Formatters and storage class delegates provide the interface between Butler and the python types it is storing and retrieving. A Formatter is responsible for serializing a Python type to an external storage system and reading that serialized form back into Python. The serialization can be stored to a local file system or cloud storage or even a database. It is possible for a formatter to be globally configured to use particular parameters on write. On retrieval of datasets read parameters can be used that can, for example, return only a subset of the data.

Storage class delegates are used to disassemble and reassemble composite datasets and can also be used to process read parameters that adjust how the retrieved dataset might be modified on get.

Deciding which formatter or delegate to use is controlled by the storage class and corresponding dataset type.

Note

When discussing configuration below, the default configuration values can be inspected at $DAF_BUTLER_DIR/python/lsst/daf/butler/configs (they can be accessed directly as Python package resources) and current values can be obtained by calling butler config-dump on a Butler repository. For example, this will list the formatter section of a butler configuration (assuming a single file-based datastore is in use):

butler config-dump --subset .datastore.formatters ./repo-dir

Storage Classes

A Storage Class is fundamental to informing Butler how to deal with specific Python types. Each DatasetType is associated with a StorageClass when it is defined (usually as part of a pipeline configuration). This storage class declares the Python type, any components it may have (derived or read-write), and a delegate that can be used to process read parameters and do assembly.

Composites

A composite storage class declares that the Python type consists of discrete components that can be accessed individually. Each of these components must declare its storage class as well.

For example, if a pvi dataset type has been associated with an ExposureF composite storage class, the user can Butler.get() the full pvi and access the components as they would normally for an ExposureF, or if the user solely want the metadata header from the exposure they can ask for pvi.metadata and just get that. The implementation details of how that metadata is retrieved depend on the details of how the dataset was serialized within the datastore.

Composites must declare all the components for the Python type and Butler requires that if a composite is dissassembled into its components and then reassembled to form the composite again, this operation must be lossless.

Only composites have the potential to be disassembled into discrete file artifacts by datastore. Disassembly itself is controlled by the datastore configuration. In cases where the datastore uses a remote storage and only some components are required, then disassembly can significantly improve data access performance.

Derived Components

There are some situations where a Python type has some property that can usefully be retrieved that looks like a component of a composite but is not a component since it is not an integral part of the composite Python type. This is particularly true for metadata such as bounding boxes or pixel counts which can efficiently be obtained from a large dataset without requiring that large dataset to be read into memory solely to be discarded once this information is extracted.

Derived components are read-only components that can be defined for all storage classes without that storage class being declared to be a composite. As for standard components, a derived component declares the storage class (and therefore Python type) of that component. If your Python type has useful components that can be accessed but which do not support full disassembly (because round-tripping disassembly is lossy), derived components should be defined rather than full-fledged components.

Read Parameters

A storage class definition includes read parameters that can control how a particular storage class is modified by Butler.get() before being returned to the caller. These read parameters can be thought of as being understood by the Python type being returned and modifying it in a way that still returns that identical Python type. The canonical example of this is subsetting where the caller passes in a bounding box that reduces the size of the image being returned.

If a parameter would change the Python type its functionality should be encapsulated in a derived component.

Note

Parameters that control how to serialize a dataset into an artifact within the datastore are not supported by user code doing a Butler.put(). This is because a user storing a dataset does not know which formatter the datastore has been configured to use or even if a dataset will be persisted. For example, the caller has no real idea whether a particular compression option is supported or not because they have no visibility into whether the file written is in HDF5 or FITS or plain text. For this reason write parameters are part of formatter configuration.

Read Parameters and Derived Components

Read parameters are usually applied during the retrieval of the associated Python type from the persisted artifact. This requires that the parameters are understood by that Python type. When derived components involve read parameters there are now multiple ways in which the parameters can be applied.

Consider the case of a pixel counting derived component for an image dataset. A read parameter for subsetting the dataset should be applied to the image before the pixel counting is performed. It does not make sense for subsetting to be applied to the integer pixel count.

For this reason read parameters for derived components are processed prior to calculating the derived component.

Defining a Storage Class

Storage classes are defined in the storageClasses section of the Butler configuration YAML file. A very simple declaration of a storage class with an associated python type looks like:

NumPixels:
  pytype: int

This declares that the NumPixels storage class is defined as a Python int. Nothing more is required for simple types.

A composite storage class refers to a Python type that can be disassembled into distinct components that can be retrieved independently:

MaskedImage:
  pytype: lsst.afw.image.MaskedImage
  delegate: lsst.something.MaskedImageDelegate
  parameters:
    - subset
  components:
    image: Image
    mask: Mask
  derivedComponents:
    npixels: NumPixels

In this simplified definition for a masked image, there are two components declared along with a derived component that returns the number of pixels in the image. The delegate should be able to disassemble the associated Python type into the image and mask components if the datastore requests disassembly. The delegate would also be used to process the subset read parameter if the formatter used by the datastore has declared it does not support the parameter.

In some cases you may want to define specific storage classes that are specializations of a more generic definition. You can do this using YAML anchors and references but the preferred approach is to use the inheritsFrom key in the storage class definition:

GenericStorageClass:
   pytype: lsst.generic.GenericX
   components:
     image: ImageX
     metadata: Metadata
GenericStorageClassI:
  inheritsFrom: GenericStorageClass
  pytype: lsst.generic.GenericI
  components:
    image: ImageI

If this approach is used the StorageClass Python class created by StorageClassFactory will inherit from the specific parent class and not the generic StorageClass.

Storage Class Delegates

Every StorageClass that defines read parameters or components (read/write or derived) must also specify a storage class delegate class which should inherit from the StorageClassDelegate base class.

Composite Disassembly

A composite is declared by specifying components in the StorageClass definition. Storage class delegate classes must provide at minimum a StorageClassDelegate.getComponent() method to enable a specific component to be extracted from the composite Python type. Datastores can be configured to prefer to write composite datasets out as the individual components and to reconstruct the composite on read. This can lead to more efficient use of datastore bandwidth (especially an issue for an S3-like storage rather than a local file system) if a pipeline always takes as input a component and does not require the full dataset or if a user in the science platform wants to retrieve the metadata for many datasets. To allow this the delegate subclass must provide StorageClassDelegate.assemble() and StorageClassDelegate.disassemble().

Datastores can be configured to always disassemble composites or never disassemble them. Additionally datastores can choose to only disassemble specific storage classes or dataset types.

Warning

Composite disassembly implicitly assumes that an identical Python object can be created from the disassembled components. If this is not true, the components should be declared derived (see next section) and disassembly will never be attempted.

Derived Components

Just as for components of a composite, if a storage class defines derived components, it must also specify a delegate to support the calculation of that derived component. This should be implemented in the StorageClassDelegate.getComponent() method.

Additionally, if the storage class refers to a composite, the datastore can be configured to disassemble the dataset into discrete artifacts. Since derived components are computed and are not persisted themselves, the datastore needs to be told which component should be used to calculate this derived quantity. To enable this the delegate must implement StorageClassDelegate.selectResponsibleComponent(). This method is given the name of the derived component and a list of all available persisted components and must return one and only one relevant component. The datastore will then make a component request to the Formatter associated with that component.

Note

All delegates must support read/write components and derived components in the StorageClassDelegate.getComponent() implementation method. As a corollary, all storage classes using components must specify a delegate.

Note

A component returned by selectResponsibleComponent() may require a custom formatter, to support the derived component, even if it otherwise would not.

Read Parameters

Read parameters are used to adjust what is returned by the Butler.get() call but there is a requirement that whatever those read parameters do to modify the Butler.get() the Python type returned must match the type associated with the Butler.StorageClass associated with the Butler.DatasetType. For example this means that a read parameter that subsets an image is valid because the type returned would still be an image.

If read parameters are defined then a StorageClassDelegate.handleParameters() method must be defined that understands how to apply these parameters to the Python object and should return a modified copy. This method must be written even if a Formatter is to be used. There are two reasons for this; firstly, there is no guarantee that a particular formatter implementation will understand the parameter (and no requirement for that to be the case), and secondly there is no guarantee that a formatter will be involved in retrieval of the dataset. In-memory datastores never involve a file artifact so whilst composite disassembly is never an issue, a delegate must at least provide the parameter handler to allow the user to configure such a datastore.

For derived components parameters are handled by the composite component prior to deriving the derived component. The delegate StorageClassDelegate.handleParameters() method will only be called in this situation if no formatter is used (such as with an in-memory datastore).

Formatters

Formatters are responsible for serializing a Python type to a storage system and for reconstructing the Python type from the serialized form. A formatter has to implement at minimum a Formatter.read() method and a Formatter.write() method. The write() method takes a Python object and serializes it somewhere and the read() method is optionally given a component name and returns the matching Python object. Details of where the artifact may be located within the datastore are passed to the constructor by the datastore as a FileDescriptor instance.

Warning

The formatter system has only been used to write datasets to files or to bytes that would be written to a file. The interface may evolve as other types of datastore become available and make use of the formatter system. The interface is being reassessed on DM-26658.

When ingesting files from external sources formatters are associated with each incoming file but these formatters are only required to support a Formatter.read() method. They must though declare all the file extensions that they can support. This allows the datastore to ensure that the image being ingested has not obviously been associated with a formatter that does not recognize it.

In the current implementation that is focussed entirely on external files in datastores, the location of the serialized data is available to the formatter using the Formatter.fileDescriptor property. This FileDescriptor property makes the file location available as a Location and also gives access to read parameters supplied by the caller and also defines the StorageClass of the dataset being written. On read the the storage class used to read the file can be different from the storage class expected to be returned by Datastore. This happens if a composite was written but a component from that composite is being read.

File Extensions

Each formatter that reads or writes a file must declare the file extensions that it supports. For a formatter that supports a single extension this is most easily achieved by setting the class property Formatter.extension to that extension. In some scenarios a formatter might support multiple formats that are controlled by write parameters. In this case the formatter should assign a frozen set to the Formatter.supportedExtensions class property. It is then required that the class implements an instance property for extension that returns the extension that will be used by this formatter for writing the current dataset.

File vs Bytes

Some datastores can stream bytes from remote storage systems and do not require that a local file is created before the Python object can be created. To support this use case an implementer can implement Formatter.fromBytes() for reading in from a datastore and Formatter.toBytes() for serializing to a datastore. If a formatter raises NotImplementedError when these byte-like methods are called the datastore will default to using the Formatter.read() and Formatter.write() methods making use of local temporary files.

Warning

This interface has some rough edges since it is not yet possible for the formatter to optionally support bytes directly based on the amount of data involved. Even though bytes may be more efficient for small or medium-sized datasets, in some cases with significant datasets the memory overhead of multiple copies may be excessive and a temporary file would be more prudent. Neither datastore nor the formatter can opt out of using bytes on a per-dataset basis.

FileFormatter Subclass

For many file-based formatter implementations a subclass of Formatter can be used that has a much simplified interface. FileFormatter allows a formatter implementation to be written using two methods: _readFile() takes a local path to the file system and the expected Python type, and _writeFile() takes the in-memory object to be serialized.

Composites are not handled by FileFormatter.

Note

The design of this class hierarchy will be reassessed in DM-26658.

Write Parameters

Datastores can be configured to specify parameters that can control how a formatter serializes a Python object. These configuration parameters are not available to Butler users as part of Butler.put since the user does not know how a datastore is configured or which formatter will be used for a particular DatasetType.

When datastore instantiates the Formatter the relevant write parameters are supplied. These write parameters can be accessed when the data are written and they can control any aspect of the write. The only caveat is that the Formatter.read method must be able to read the resulting file without having to know which write parameters were used to create it. The Formatter.read method can look at the file extension and file metadata but it will not have the write parameters supplied to it by datastore.

Write Recipes

Sometimes you would like a formatter to be configured in the same way for all dataset types that use it but the configuration is very detailed. An example of this is the configuration of data compression parameters for FITS files. Rather than require that every formatter is explicitly configured with this detail, we have the concept of named write recipes. Write recipes have their own configuration section and are associated with a specific formatter class and contain named collections of parameters. The write parameters can then specify one of the named recipes by name.

If write recipes are used the formatter should implement a Formatter.validateWriteRecipes method. This method not only checks that the parameters are reasonable, it can also update the parameters with default values to make them self-consistent.

Configuring Formatters

Formatter configuration matches on dataset type, storage class, or data ID as described in Name Matching and is present in the formatters section of the datastore YAML configuration. The simplest configuration maps one of these keys to a fully-qualified python formatter class. For example:

Defects: lsst.obs.base.formatters.fitsGeneric.FitsGenericFormatter
Exposure: lsst.obs.base.formatters.fitsExposure.FitsExposureFormatter

Here we have two storage classes and they each point to a different formatter.

If a particular entry needs write parameters they can be defined by expanding the hierarchy:

Packages:
  formatter: lsst.obs.base.formatters.packages.PackagesFormatter
  parameters:
    format: yaml

Here the Packages storage class is associated with a formatter and the write parameters define one format option.

Sometimes it is required that every usage of a specific formatter should be configured in a uniform way. This can be done using the magic default entry:

default:
  lsst.obs.base.formatters.fitsExposure.FitsExposureFormatter:
    # default is the default recipe regardless but this demonstrates
    # how to specify a default write parameter
    recipe: lossless

Here we are declaring that every write using the FitsExposureFormatter should by default be configured to use the lossless compression write recipe (the recipe parameter here is not special, but is understood by the formatter to mean a key into the write recipes configurations). Parameters associated with a specific entry will be merged with the defaults. This can allow lossless compression by default but allow specific dataset types to use lossy compression.

Write recipes also get their own magic key at the top level:

write_recipes:
  lsst.obs.base.formatters.fitsExposure.FitsExposureFormatter:
    recipe1:
      ...
    recipe2:
      ...

The write recipes are also grouped by formatter class and the ... represent arbitrary yaml configuration associated with label recipe1 and recipe2.