Organizing and identifying datasets

Each dataset in a repository is associated with an opaque unique integer ID, which we currently call its dataset_id, and it’s usually seen in Python code as the value of DatasetRef.id. This is the number used as the primary key in most Registry tables that refer to datasets, and it’s the only way the contents of a Datastore are matched to those in a Registry. With that number, the dataset is fully identified, and anything else about it can be unambiguously looked up. We call a DatasetRef whose id attribute is not None a resolved DatasetRef.

Note

In most data repositories, dataset IDs are 128-bit UUIDs that are guaranteed to be unique across all data repositories, not just within one; if two datasets share the same UUID in different data repositories, they must be identical (this is possible because of the extraordinarily low probability of a collision between two random 128-bit numbers, and our reservation of deterministic UUIDs for very special datasets). As a result, we also frequently refer to the dataset ID as the UUID, especially in contexts where UUIDs are actually needed or can be safely assumed. But 64-bit autoincrement integers are also supported (albeit mostly for legacy reasons), and we continue to use “dataset ID” in most code and documentation to refer to either form.

Most of the time, however, users identify a dataset using a combination of three other attributes:

  • a dataset type;
  • a data ID (also known as data coordinates);
  • a collection.

Most collections are constrained to contain only one dataset with a particular dataset type and data ID, so this combination is usually enough to resolve a dataset (see Collections for exceptions).

A dataset’s type and data ID are intrinsic to it — while there may be many datasets with a particular dataset type and/or data ID, the dataset type and data ID associated with a dataset are set and fixed when it is created. A DatasetRef always has both a dataset type attribute and a data ID, though the latter may be empty. Dataset types are discussed below in Dataset types, while data IDs are one aspect of the larger Dimensions system and are discussed in Data IDs.

In contrast, the relationship between datasets and collections is many-to-many: a collection typically contains many different datasets, and a particular dataset may belong to multiple collections. As a result, is is common to search for datasets in multiple collections (often in a well-defined order), and interfaces that provide that functionality can accept a collection search path in many different forms. Collections are discussed further below in Collections.

Dataset types

The names “dataset” and “dataset type” (which daf_butler inherits from its daf_persistence predecessor) are intended to evoke the relationship between an instance and its class in object-oriented programming, but this is a metaphor, not a relationship that maps to any particular Python objects: we don’t have any Python class that fully represents the dataset concept (DatasetRef is the closest), and the DatasetType class is a regular class, not a metaclass. So a dataset type is represented in Python as a DatasetType instance.

A dataset type defines both the dimensions used in a dataset’s data ID (so all data IDs for a particular dataset type have the same keys, at least when put in standard form) and the storage class that corresponds to its in-memory Python type and maps to the file format (or generalization thereof) used by a Datastore to store it. These are associated with an arbitrary string name.

Beyond that definition, what a dataset type means isn’t really specified by the butler itself, but we expect higher-level code that uses butler to make that clear, and one anticipated case is worth calling out here: a dataset type roughly corresponds to the role its datasets play in a processing pipeline. In other words, a particular pipeline will typically accept particular dataset types as inputs and produce particular dataset types as outputs (and may produce and consume other dataset types as intermediates). And while the exact dataset types used may be configurable, changing a dataset type will generally involve substituting one dataset type for a very similar one (most of the time with the same dimensions and storage class).

Collections

Collections are lightweight groups of datasets defined in the Registry. Groups of self-consistent calibration datasets, the outputs of a processing run, and the set of all raw images for a particular instrument are all examples of collections. Collections are referred to in code simply as str names; various Registry methods can be used to manage them and obtain additional information about them when relevant.

There are multiple types of collections, corresponding to the different values of the CollectionType enum. All collection types are usable in the same way in any context where existing datasets are being queried or retrieved, though the actual searches may be implemented quite differently in terms of database queries. Collection types behave completely differently in terms of how and when datasets can be added to or remove from them.

Run Collections

A dataset is always added to a CollectionType.RUN collection when it is inserted into the Registry, and can never be removed from it without fully removing the dataset from the Registry. There is no other way to add a dataset to a RUN collection. The run collection name must be used in any file path templates used by any Datastore in order to guarantee uniqueness (other collection types are too flexible to guarantee continued uniqueness over the life of the dataset).

The name “run” reflects the fact that we expect most RUN collections to be used to store the outputs of processing runs, but they should also be used in any other context in which their lack of flexibility is acceptable, as they are the most efficient type of collection to store and query.

RUN collections that do represent the outputs of processing runs can be associated with a host name string and a timespan, and are expected to be the way in which some provenance is associated with datasets (e.g. a dataset that contains a list of software versions would have the same RUN as the datasets produced by a processing run that used those versions).

Like most collections, a RUN can contain at most one dataset with a particular dataset type and data ID.

Tagged Collections

CollectionType.TAGGED collections are the most flexible type of collection; datasets can be associated with or disassociated from a TAGGED collection at any time, as long as the usual contraint on a collection having only one dataset with a particular dataset type and data ID is maintained. Membership in a TAGGED collection is implemented in the Registry database as a single row in a many-to-many join table (a “tag”) and is completely decoupled from the actual storage of the dataset.

Tags are thus both extremely lightweight relative to copies or re-ingests of files or other Datastore content, and slightly more expensive to store and possibly query than the RUN or CHAINED collection representations (which have no per-dataset costs). The latter is rarely important, but higher-level code should avoid automatically creating TAGGED collections that may not ever be used.

Calibration Collections

CollectionType.CALIBRATION collections associate each dataset they contain with a temporal validity range. The usual constraint on dataset type and data ID uniqueness is enforced as a function of time, not collection-wide - so for any particular dataset type and data ID combination, the validity range timespans may not overlap (but may be - and usually are - adjacent).

In other respects, CALIBRATION collections closely resemble TAGGED collections: they are also backed by a many-to-many join table (where each row has a timespan as well as a collection identifier and a dataset identifier), and datasets can be associated or disassociated from them similarly freely. We use slightly different nomenclature for these operations, reflecting the high-level actions they represent: certifying a dataset adds it to a CALIBRATION collection with a particular validity range, and decertifying a dataset removes some or all of that validity range.

The same dataset can be present in a CALIBRATION collection multiple times with different validity ranges.

Chained Collections

A CollectionType.CHAINED collection is essentially a multi-collection search path that has been saved in the Registry database and associated with a name of its own. Querying a CHAINED collection simply queries its child collections in order, and a CHAINED collection is always (and only) updated when its child collections are.

CHAINED collections may contain other chained collections, as long as they do not contain cycles, and they can also include restrictions on the dataset types to search for within each child collection (see Collection expressions).

The usual constraint on dataset type and data ID uniqueness within a collection is only lazily enforced for chained collections: operations that query them either deduplicate results themselves or terminate single-dataset searches after the first match in a child collection is found. In some methods, like Registry.queryDatasets, this behavior is optional: passing findFirst=True will enforce the constraint, while findFirst=False will not.