.. _daf_butler_storageclass_formatters_delegates:

########################################################
Storage Classes, Storage Class Delegates, and Formatters
########################################################

.. py:currentmodule:: lsst.daf.butler

Formatters and storage class delegates provide the interface between Butler and the python types it is storing and retrieving.
A Formatter is responsible for serializing a Python type to an external storage system and reading that serialized form back into Python.
The serialization can be stored to a local file system or cloud storage or even a database.
It is possible for a formatter to be globally configured to use particular parameters on write.
On retrieval of datasets read parameters can be used that can, for example, return only a subset of the data.

Storage class delegates are used to disassemble and reassemble composite datasets and can also be used to process read parameters that adjust how the retrieved dataset might be modified on get.

Deciding which formatter or delegate to use is controlled by the storage class and corresponding dataset type.

.. note::

  When discussing configuration below, the default configuration values can be inspected at ``$DAF_BUTLER_DIR/python/lsst/daf/butler/configs`` (they can be accessed directly as Python package resources) and current values can be obtained by calling ``butler config-dump`` on a Butler repository.
  For example, this will list the formatter section of a butler configuration (assuming a single file-based datastore is in use):

  .. prompt:: bash

     butler config-dump --subset .datastore.formatters ./repo-dir

.. _daf_butler_storage_classes:

Storage Classes
===============

A Storage Class is fundamental to informing Butler how to deal with specific Python types.
Each `DatasetType` is associated with a `StorageClass` when it is defined (usually as part of a pipeline configuration).
This storage class declares the Python type, any components it may have (derived or read-write), and a delegate that can be used to process read parameters and do assembly.

Composites
^^^^^^^^^^

A composite storage class declares that the Python type consists of discrete components that can be accessed individually.
Each of these components must declare its storage class as well.

For example, if a ``pvi`` dataset type has been associated with an ``ExposureF`` composite storage class, the user can `Butler.get()` the full ``pvi`` and access the components as they would normally for an `~lsst.afw.image.ExposureF`, or if the user solely want the metadata header from the exposure they can ask for ``pvi.metadata`` and just get that.
The implementation details of how that metadata is retrieved depend on the details of how the dataset was serialized within the datastore.

Composites must declare **all** the components for the Python type and Butler requires that if a composite is dissassembled into its components and then reassembled to form the composite again, this operation must be lossless.

Only composites have the potential to be disassembled into discrete file artifacts by datastore.
Disassembly itself is controlled by the datastore configuration.
In cases where the datastore uses a remote storage and only some components are required, then disassembly can significantly improve data access performance.

Derived Components
^^^^^^^^^^^^^^^^^^

There are some situations where a Python type has some property that can usefully be retrieved that looks like a component of a composite but is not a component since it is not an integral part of the composite Python type.
This is particularly true for metadata such as bounding boxes or pixel counts which can efficiently be obtained from a large dataset without requiring that large dataset to be read into memory solely to be discarded once this information is extracted.

Derived components are read-only components that can be defined for all storage classes without that storage class being declared to be a composite.
As for standard components, a derived component declares the storage class (and therefore Python type) of that component.
If your Python type has useful components that can be accessed but which do not support full disassembly (because round-tripping disassembly is lossy), derived components should be defined rather than full-fledged components.

Read Parameters
^^^^^^^^^^^^^^^

A storage class definition includes read parameters that can control how a particular storage class is modified by `Butler.get()` before being returned to the caller.
These read parameters can be thought of as being understood by the Python type being returned and modifying it in a way that still returns that identical Python type.
The canonical example of this is subsetting where the caller passes in a bounding box that reduces the size of the image being returned.

If a parameter would change the Python type its functionality should be encapsulated in a derived component.

.. note::

  Parameters that control how to serialize a dataset into an artifact within the datastore are not supported by user code doing a `Butler.put()`.
  This is because a user storing a dataset does not know which formatter the datastore has been configured to use or even if a dataset will be persisted.
  For example, the caller has no real idea whether a particular compression option is supported or not because they have no visibility into whether the file written is in HDF5 or FITS or plain text.
  For this reason write parameters are part of formatter configuration.

Read Parameters and Derived Components
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Read parameters are usually applied during the retrieval of the associated Python type from the persisted artifact.
This requires that the parameters are understood by that Python type.
When derived components involve read parameters there are now multiple ways in which the parameters can be applied.

Consider the case of a pixel counting derived component for an image dataset.
A read parameter for subsetting the dataset should be applied to the image before the pixel counting is performed.
It does not make sense for subsetting to be applied to the integer pixel count.

For this reason read parameters for derived components are processed prior to calculating the derived component.

Defining a Storage Class
^^^^^^^^^^^^^^^^^^^^^^^^

Storage classes are defined in the ``storageClasses`` section of the Butler configuration YAML file.
A very simple declaration of a storage class with an associated python type looks like:

.. code-block:: yaml

   NumPixels:
     pytype: int

This declares that the ``NumPixels`` storage class is defined as a Python `int`.
Nothing more is required for simple types.

A composite storage class refers to a Python type that can be disassembled into distinct components that can be retrieved independently:

.. code-block:: yaml

  MaskedImage:
    pytype: lsst.afw.image.MaskedImage
    delegate: lsst.something.MaskedImageDelegate
    parameters:
      - subset
    components:
      image: Image
      mask: Mask
    derivedComponents:
      npixels: NumPixels

In this simplified definition for a masked image, there are two components declared along with a derived component that returns the number of pixels in the image.
The delegate should be able to disassemble the associated Python type into the ``image`` and ``mask`` components if the datastore requests disassembly.
The delegate would also be used to process the ``subset`` read parameter if the formatter used by the datastore has declared it does not support the parameter.

In some cases you may want to define specific storage classes that are specializations of a more generic definition.
You can do this using YAML anchors and references but the preferred approach is to use the ``inheritsFrom`` key in the storage class definition:

.. code-block:: yaml

   GenericStorageClass:
      pytype: lsst.generic.GenericX
      components:
        image: ImageX
        metadata: Metadata
   GenericStorageClassI:
     inheritsFrom: GenericStorageClass
     pytype: lsst.generic.GenericI
     components:
       image: ImageI

If this approach is used the `StorageClass` Python class created by `StorageClassFactory` will inherit from the specific parent class and not the generic `StorageClass`.

Storage Class Delegates
=======================

Every `StorageClass` that defines read parameters or components (read/write or derived) must also specify a storage class delegate class which should inherit from the `StorageClassDelegate` base class.

Composite Disassembly
^^^^^^^^^^^^^^^^^^^^^

A composite is declared by specifying components in the `StorageClass` definition.
Storage class delegate classes must provide at minimum a `StorageClassDelegate.getComponent()` method to enable a specific component to be extracted from the composite Python type.
Datastores can be configured to prefer to write composite datasets out as the individual components and to reconstruct the composite on read.
This can lead to more efficient use of datastore bandwidth (especially an issue for an S3-like storage rather than a local file system) if a pipeline always takes as input a component and does not require the full dataset or if a user in the science platform wants to retrieve the metadata for many datasets.
To allow this the delegate subclass must provide `StorageClassDelegate.assemble()` and `StorageClassDelegate.disassemble()`.

Datastores can be configured to always disassemble composites or never disassemble them.
Additionally datastores can choose to only disassemble specific storage classes or dataset types.

.. warning::

  Composite disassembly implicitly assumes that an identical Python object can be created from the disassembled components.
  If this is not true, the components should be declared derived (see next section) and disassembly will never be attempted.

Derived Components
^^^^^^^^^^^^^^^^^^

Just as for components of a composite, if a storage class defines derived components, it must also specify a delegate to support the calculation of that derived component.
This should be implemented in the `StorageClassDelegate.getComponent()` method.

Additionally, if the storage class refers to a composite, the datastore can be configured to disassemble the dataset into discrete artifacts.
Since derived components are computed and are not persisted themselves, the datastore needs to be told which component should be used to calculate this derived quantity.
To enable this the delegate must implement `StorageClassDelegate.selectResponsibleComponent()`.
This method is given the name of the derived component and a list of all available persisted components and must return one and only one relevant component.
The datastore will then make a component request to the `~lsst.daf.butler.Formatter` associated with that component.

.. note::

  All delegates must support read/write components and derived components in the `StorageClassDelegate.getComponent()` implementation method.
  As a corollary, all storage classes using components must specify a delegate.

.. note::

   A component returned by `~StorageClassDelegate.selectResponsibleComponent()` may require a custom formatter, to support the derived component, even if it otherwise would not.

Read Parameters
^^^^^^^^^^^^^^^

Read parameters are used to adjust what is returned by the `Butler.get()` call but there is a requirement that whatever those read parameters do to modify the `Butler.get()` the Python type returned must match the type associated with the `Butler.StorageClass` associated with the `Butler.DatasetType`.
For example this means that a read parameter that subsets an image is valid because the type returned would still be an image.

If read parameters are defined then a `StorageClassDelegate.handleParameters()` method must be defined that understands how to apply these parameters to the Python object and should return a modified copy.
This method must be written even if a `Formatter` is to be used.
There are two reasons for this; firstly, there is no guarantee that a particular formatter implementation will understand the parameter (and no requirement for that to be the case), and secondly there is no guarantee that a formatter will be involved in retrieval of the dataset.
In-memory datastores never involve a file artifact so whilst composite disassembly is never an issue, a delegate must at least provide the parameter handler to allow the user to configure such a datastore.

For derived components parameters are handled by the composite component prior to deriving the derived component.
The delegate `StorageClassDelegate.handleParameters()` method will only be called in this situation if no formatter is used (such as with an in-memory datastore).

Formatters
==========

Formatters are responsible for serializing a Python type to a storage system and for reconstructing the Python type from the serialized form.
A formatter has to implement at minimum a `Formatter.read()` method and a `Formatter.write()` method.
The ``write()`` method takes a Python object and serializes it somewhere and the ``read()`` method is optionally given a component name and returns the matching Python object.
Details of where the artifact may be located within the datastore are passed to the constructor by the datastore as a `FileDescriptor` instance.

.. warning::

  The formatter system has only been used to write datasets to files or to bytes that would be written to a file.
  The interface may evolve as other types of datastore become available and make use of the formatter system.
  The interface is being reassessed on :jira:`DM-26658`.

When ingesting files from external sources formatters are associated with each incoming file but these formatters are only required to support a `Formatter.read()` method.
They must though declare all the file extensions that they can support.
This allows the datastore to ensure that the image being ingested has not obviously been associated with a formatter that does not recognize it.

In the current implementation that is focussed entirely on external files in datastores, the location of the serialized data is available to the formatter using the `Formatter.fileDescriptor` property.
This `FileDescriptor` property makes the file location available as a `Location` and also gives access to read parameters supplied by the caller and also defines the `StorageClass` of the dataset being written.
On read the the storage class used to read the file can be different from the storage class expected to be returned by `Datastore`.
This happens if a composite was written but a component from that composite is being read.

File Extensions
^^^^^^^^^^^^^^^

Each formatter that reads or writes a file must declare the file extensions that it supports.
For a formatter that supports a single extension this is most easily achieved by setting the class property `Formatter.extension` to that extension.
In some scenarios a formatter might support multiple formats that are controlled by write parameters.
In this case the formatter should assign a frozen set to the `Formatter.supportedExtensions` class property.
It is then required that the class implements an instance property for ``extension`` that returns the extension that will be used by this formatter for writing the current dataset.

File vs Bytes
^^^^^^^^^^^^^

Some datastores can stream bytes from remote storage systems and do not require that a local file is created before the Python object can be created.
To support this use case an implementer can implement `Formatter.fromBytes()` for reading in from a datastore and `Formatter.toBytes()` for serializing to a datastore.
If a formatter raises `NotImplementedError` when these byte-like methods are called the datastore will default to using the `Formatter.read()` and `Formatter.write()` methods making use of local temporary files.

.. warning::

  This interface has some rough edges since it is not yet possible for the formatter to optionally support bytes directly based on the amount of data involved.
  Even though bytes may be more efficient for small or medium-sized datasets, in some cases with significant datasets the memory overhead of multiple copies may be excessive and a temporary file would be more prudent.
  Neither datastore nor the formatter can opt out of using bytes on a per-dataset basis.

FileFormatter Subclass
^^^^^^^^^^^^^^^^^^^^^^

For many file-based formatter implementations a subclass of `Formatter` can be used that has a much simplified interface.
`~formatters.file.FileFormatter` allows a formatter implementation to be written using two methods: `~formatters.file.FileFormatter._readFile()` takes a local path to the file system and the expected Python type, and `~formatters.file.FileFormatter._writeFile()` takes the in-memory object to be serialized.

Composites are not handled by `~formatters.file.FileFormatter`.

.. note::

   The design of this class hierarchy will be reassessed in :jira:`DM-26658`.

Write Parameters
^^^^^^^^^^^^^^^^

Datastores can be configured to specify parameters that can control how a formatter serializes a Python object.
These configuration parameters are not available to `Butler` users as part of `Butler.put` since the user does not know how a datastore is configured or which formatter will be used for a particular `DatasetType`.

When datastore instantiates the `Formatter` the relevant write parameters are supplied.
These write parameters can be accessed when the data are written and they can control any aspect of the write.
The only caveat is that the `Formatter.read` method must be able to read the resulting file without having to know which write parameters were used to create it.
The `Formatter.read` method can look at the file extension and file metadata but it will not have the write parameters supplied to it by datastore.

Write Recipes
^^^^^^^^^^^^^

Sometimes you would like a formatter to be configured in the same way for all dataset types that use it but the configuration is very detailed.
An example of this is the configuration of data compression parameters for FITS files.
Rather than require that every formatter is explicitly configured with this detail, we have the concept of named write recipes.
Write recipes have their own configuration section and are associated with a specific formatter class and contain named collections of parameters.
The write parameters can then specify one of the named recipes by name.

If write recipes are used the formatter should implement a `Formatter.validateWriteRecipes` method.
This method not only checks that the parameters are reasonable, it can also update the parameters with default values to make them self-consistent.

Configuring Formatters
^^^^^^^^^^^^^^^^^^^^^^

Formatter configuration matches on dataset type, storage class, or data ID as described in :ref:`daf_butler-config-lookups` and is present in the ``formatters`` section of the datastore YAML configuration.
The simplest configuration maps one of these keys to a fully-qualified python formatter class.
For example:

.. code-block:: yaml

   Defects: lsst.obs.base.formatters.fitsGeneric.FitsGenericFormatter
   Exposure: lsst.obs.base.formatters.fitsExposure.FitsExposureFormatter

Here we have two storage classes and they each point to a different formatter.

If a particular entry needs write parameters they can be defined by expanding the hierarchy:

.. code-block:: yaml

  Packages:
    formatter: lsst.obs.base.formatters.packages.PackagesFormatter
    parameters:
      format: yaml

Here the ``Packages`` storage class is associated with a formatter and the write parameters define one ``format`` option.

Sometimes it is required that every usage of a specific formatter should be configured in a uniform way.
This can be done using the magic ``default`` entry:

.. code-block:: yaml

  default:
    lsst.obs.base.formatters.fitsExposure.FitsExposureFormatter:
      # default is the default recipe regardless but this demonstrates
      # how to specify a default write parameter
      recipe: lossless

Here we are declaring that every write using the ``FitsExposureFormatter`` should by default be configured to use the ``lossless`` compression write recipe (the ``recipe`` parameter here is not special, but is understood by the formatter to mean a key into the write recipes configurations).
Parameters associated with a specific entry will be merged with the defaults.
This can allow lossless compression by default but allow specific dataset types to use lossy compression.

Write recipes also get their own magic key at the top level:

.. code-block:: yaml

  write_recipes:
    lsst.obs.base.formatters.fitsExposure.FitsExposureFormatter:
      recipe1:
        ...
      recipe2:
        ...

The write recipes are also grouped by formatter class and the ``...`` represent arbitrary yaml configuration associated with label ``recipe1`` and ``recipe2``.