Overview¶
Dimensions are astronomical concepts that are used to label and organize datasets.
In the Registry
database, most dimensions are associated with a table that contains not just the primary key values for dimensions, but foreign key relationships to other dimension tables, metadata columns, and in some cases spatial regions and time intervals as well.
Examples of dimensions include instruments, detectors, visits, and tracts.
Instances of the Dimension
class represent one of these concepts, not values of the type of one of those concepts (e.g. “detector”, not a particular detector).
In fact, a dimension “value” can mean different things in different contexts: it could mean the value of the primary key or other unique identifier for a particular entity (the integer ID or string name for a particular detector), or it could represent a complete record in the table for that dimension.
The dimensions schema also has some tables that do not map directly to Dimension
instances.
Some of these provide extra metadata fields for combinations of dimensions, and are represented by the DimensionElement
class in Python (this is also the base class of the Dimension
class, and provides much of its functionality).
Others represent the overlaps between spatial dimensions, and are discussed in Spatial and Temporal Dimensions.
Dimension Relationships and Containers¶
Dimensions may have relationships, and in fact these relationships are used almost exclusively to define relationships between datasets, which have no direct relationships between them. There are two kinds of relationships:
- If dimension
A
is a required dependency of dimensionB
,A
’s primary key value must be provided in order to uniquely identifyB
, andA
has a one-to-many relationship withB
. For example, the detector dimension has a required dependency on the instrument dimension, and hence one uses both an instrument name and a detector ID (or detector name) to fully identify a detector. When both dimensions are associated with database tables, a required dependency involves having the dependency’s primary key fields both foreign key fields and part of a compound primary key in the dependent table.- If dimension
C
is an implied dependency of dimensionD
, a value forD
implies a value forC
, andC
has a one-to-many relationship withD
. For example, the visit dimension has an implied dependency on the physical filter dimension, because a visit is observed through exactly one filter and hence each visit ID determines a filter name. When both dimensions are associated with database tables, an implied dependency involves having a foreign key field in the dependent table that is not part of a primary key in the dependent table.
A DimensionGraph
is an immutable, set-like container of dimensions that is guaranteed to (recursively) include all dependencies of any dimension in the graph.
It also categorizes those dimensions into required
and implied
subsets, which have roughly the same meaning for a set of dimensions as they do for a single dimension: once the primary key values of all of the required dimensions are known, the primary key values of all implied dimensions are known as well.
DimensionGraph
also guarantees a deterministic and topological sort order for its elements.
Because Dimension
instances have a name
attribute, we typically
use NamedValueSet
and NamedKeyDict
as containers when immutability is needed or the guarantees of DimensionGraph
.
This allows the string names of dimensions to be used as well in most places where Dimension
instances are expected.
The complete set of all compatible dimensions is held by a special subclass of DimensionGraph
, DimensionUniverse
.
A dimension universe is constructed from configuration, and is responsible for constructing all Dimension
and DimensionElement
instances; within a universe, there is exactly one Dimension
instance that is always used to represent a particular dimension.
DimensionUniverse
instances themselves are held in a global map keyed by the version number in the configuration used for construction, so they behave somewhat like singletons.
Data IDs¶
The most common way butler users encounter dimensions is as the keys in a data ID, a dictionary that maps dimensions to their primary key values.
Different datasets with the same DatasetType
are always identified by the same set of dimensions (i.e. the same set of data ID keys), and hence a DatasetType
instance holds a DimensionGraph
that contains exactly those keys.
Many data IDs are simply Python dictionaries that use the string names of dimensions or actual Dimension
instances as keys.
Most Butler
and Registry
APIs that accept data IDs as input accept both dictionaries and keyword arguments that are added to these dictionaries automatically.
The data IDs returned by the Butler
or Registry
(and most of those used internally) are usually instances of the DataCoordinate
class.
DataCoordinate
instances can have different states of knowledge about the dimensions they identify.
They always contain at least the key-value pairs that correspond to its DimensionGraph
’s required
subset – that is, the minimal set of keys needed to fully identify all other dimensions in the graph.
They can also contain key-value pairs for the implied
subset (a state indicated by DataCoordinate.hasFull()
returning True
).
And if DataCoordinate.hasRecords
returns True
, the data ID also holds all of the metadata records associated with its dimensions.
DataCoordinate
objects can of course be used with standard Python built-in containers, but an interface (DataCoordinateIterable
) and a few simple adapters (DataCoordinateSet
, DataCoordinateSequence
) also exist to provide a bit more functionality for homogenous collections of data IDs (in which all data IDs identify the same dimensions, and generally have the same hasFull
/ hasRecords
state).
Spatial and Temporal Dimensions¶
Dimensions can be spatial or temporal (or both, or neither), meaning that each record is associated with a region on the sky or a timespan (respectively).
The overlaps between regions and timespans define many-to-many relationships between dimensions that — along with the one-to-many ID-based dependencies — generally provide a way to fully relate any set of dimensions.
This produces a natural, concise query system; dimension relationships can be used to construct the full JOIN
clause of a SQL SELECT
with no input from the user, allowing them to specify just the WHERE
clause (see Registry.queryDimensions
and Registry.queryDatasets
).
It is also possible to associate a region or timespan with a combination of dimensions (such as the region for a visit and a detector), by defining a DimensionElement
for that combination.
One kind of spatial dimension is special: a SkyPixDimension
represents a complete pixelization of the sky, as defined by an lsst.sphgeom.Pixelization
object.
These are typically hierarchical pixelizations, like Hierarchical Triangular Mesh (HTM)
), Q3C
, or HEALPix (which has no lsst.sphgeom
implementation currently), but a skypix dimension encodes both the pixelization scheme and a level, defining a unique mapping from points on the sky to integer IDs.
By convention, the name of a skypix dimension starts with a short, lowercase name for the pixelization scheme followed by the integer level (e.g. “htm7”).
A moderately efficient database representation of temporal relationships is straightforward: these are overlaps of 1-d intervals, so we can use regular (i.e. B-tree) indexes to join directly on overlap expressions of intervals expressed as pairs of columns (though more specialized indexing that reflects the non-overlapping nature of many of these intervals may be necessary in the future).
The same is not true of regions (especially regions on the sphere), at least not without assuming a particular RDBMS.
Instead, spatial regions for dimensions are stored as opaque, base64
-encoded strings in the database, but we also create an overlap table for each spatial dimension element that relates it to a special “common” skypix dimension (see DimensionUniverse.commonSkyPix
).
We can then use a regular index on the common skypix ID to make spatial joins efficient, to the extent that proximity in skypix ID corresponds to proximity on sky.
In practice, these IDs correspond to some space-filling curve, which yields good typical-case performance with a reasonable choice of pixelization level, but no guarantees on worst-case performance.