Query

class lsst.daf.butler.registry.queries.Query(*, graph: lsst.daf.butler.core.dimensions._graph.DimensionGraph, whereRegion: Optional[lsst.sphgeom._sphgeom.Region], managers: lsst.daf.butler.registry.queries._structs.RegistryManagers, doomed_by: Iterable[str] = ())

Bases: abc.ABC

An abstract base class for queries that return some combination of DatasetRef and DataCoordinate objects.

Parameters:
graph : DimensionGraph

Object describing the dimensions included in the query.

whereRegion : lsst.sphgeom.Region, optional

Region that all region columns in all returned rows must overlap.

managers : RegistryManagers

A struct containing the registry manager instances used by the query system.

doomed_by : Iterable [ str ], optional

A list of messages (appropriate for e.g. logging or exceptions) that explain why the query is known to return no results even before it is executed. Queries with a non-empty list will never be executed.

Notes

The Query hierarchy abstracts over the database/SQL representation of a particular set of data IDs or datasets. It is expected to be used as a backend for other objects that provide more natural interfaces for one or both of these, not as part of a public interface to query results.

Attributes Summary

datasetType The DatasetType of datasets returned by this query, or None if there are no dataset results (DatasetType or None).
spatial An iterator over the dimension element columns used in post-query filtering of spatial overlaps (Iterator [ DimensionElement ]).
sql A SQLAlchemy object representing the full query (sqlalchemy.sql.FromClause or None).

Methods Summary

any(db, *, region, execute, exact) Test whether this query returns any results.
count(db, *, region, exact) Count the number of rows this query would return.
explain_no_results(db, *, region) Return human-readable messages that may help explain why the query yields no results.
extractDataId(row, *, graph, records, …) Extract a data ID from a result row.
extractDatasetRef(row, dataId, records, …) Extract a DatasetRef from a result row.
extractDimensionsTuple(row, dimensions) Extract a tuple of data ID values from a result row.
getDatasetColumns() Return the columns for the datasets returned by this query.
getDimensionColumn(name) Return the query column that contains the primary key value for the dimension with the given name.
getRegionColumn(name) Return a region column for one of the dimension elements iterated over by spatial.
isUnique() Return True if this query’s rows are guaranteed to be unique, and False otherwise.
makeBuilder(summary) Return a QueryBuilder that can be used to construct a new Query that is joined to (and hence constrained by) this one.
materialize(db) Execute this query and insert its results into a temporary table.
rows(db, *, region) Execute the query and yield result rows, applying predicate.
subset(*, graph, datasets, unique) Return a new Query whose columns and/or rows are (mostly) subset of this one’s.

Attributes Documentation

datasetType

The DatasetType of datasets returned by this query, or None if there are no dataset results (DatasetType or None).

spatial

An iterator over the dimension element columns used in post-query filtering of spatial overlaps (Iterator [ DimensionElement ]).

Notes

This property is intended primarily as a hook for subclasses to implement and the ABC to call in order to provide higher-level functionality; code that uses Query objects (but does not implement one) should usually not have to access this property.

sql

A SQLAlchemy object representing the full query (sqlalchemy.sql.FromClause or None).

This is None in the special case where the query has no columns, and only one logical row.

Methods Documentation

any(db: lsst.daf.butler.registry.interfaces._database.Database, *, region: Optional[lsst.sphgeom._sphgeom.Region] = None, execute: bool = True, exact: bool = True) → bool

Test whether this query returns any results.

Parameters:
db : Database

Object managing the database connection.

region : sphgeom.Region, optional

A region that any result-row regions must overlap in order to be yielded. If not provided, this will be self.whereRegion, if that exists.

execute : bool, optional

If True, execute at least a LIMIT 1 query if it cannot be determined prior to execution that the query would return no rows.

exact : bool, optional

If True, run the full query and perform post-query filtering if needed, until at least one result row is found. If False, the returned result does not account for post-query filtering, and hence may be True even when all result rows would be filtered out.

Returns:
any : bool

True if the query would (or might, depending on arguments) yield result rows. False if it definitely would not.

count(db: lsst.daf.butler.registry.interfaces._database.Database, *, region: Optional[lsst.sphgeom._sphgeom.Region] = None, exact: bool = True) → int

Count the number of rows this query would return.

Parameters:
db : Database

Object managing the database connection.

region : sphgeom.Region, optional

A region that any result-row regions must overlap in order to be yielded. If not provided, this will be self.whereRegion, if that exists.

exact : bool, optional

If True, run the full query and perform post-query filtering if needed to account for that filtering in the count. If False, the result may be an upper bound.

Returns:
count : int

The number of rows the query would return, or an upper bound if exact=False.

Notes

This counts the number of rows returned, not the number of unique rows returned, so even with exact=True it may provide only an upper bound on the number of deduplicated result rows.

explain_no_results(db: lsst.daf.butler.registry.interfaces._database.Database, *, region: Optional[lsst.sphgeom._sphgeom.Region] = None) → Iterator[str]

Return human-readable messages that may help explain why the query yields no results.

Parameters:
db : Database

Object managing the database connection.

region : sphgeom.Region, optional

A region that any result-row regions must overlap in order to be yielded. If not provided, this will be self.whereRegion, if that exists.

Returns:
messages : Iterator [ str ]

String messages that describe reasons the query might not yield any results.

Notes

Messages related to post-query filtering are only available if rows, any, or count was already called with the same region (with exact=True for the latter two).

At present, this method only returns messages that are generated while the query is being built or filtered. In the future, it may perform its own new follow-up queries, which users may wish to short-circuit simply by not continuing to iterate over its results.

extractDataId(row: Optional[sqlalchemy.engine.RowProxy], *, graph: Optional[DimensionGraph] = None, records: Optional[Mapping[str, Mapping[tuple, DimensionRecord]]] = None) → DataCoordinate

Extract a data ID from a result row.

Parameters:
row : sqlalchemy.engine.RowProxy or None

A result row from a SQLAlchemy SELECT query, or None to indicate the row from an EmptyQuery.

graph : DimensionGraph, optional

The dimensions the returned data ID should identify. If not provided, this will be all dimensions in QuerySummary.requested.

records : Mapping [ str, Mapping [ tuple, DimensionRecord ] ]

Nested mapping containing records to attach to the returned DataCoordinate, for which hasRecords will return True. If provided, outer keys must include all dimension element names in graph, and inner keys should be tuples of dimension primary key values in the same order as element.graph.required. If not provided, DataCoordinate.hasRecords will return False on the returned object.

Returns:
dataId : DataCoordinate

A data ID that identifies all required and implied dimensions. If records is not None, this is have hasRecords() return True.

extractDatasetRef(row: sqlalchemy.engine.RowProxy, dataId: Optional[DataCoordinate] = None, records: Optional[Mapping[str, Mapping[tuple, DimensionRecord]]] = None) → DatasetRef

Extract a DatasetRef from a result row.

Parameters:
row : sqlalchemy.engine.RowProxy

A result row from a SQLAlchemy SELECT query.

dataId : DataCoordinate

Data ID to attach to the DatasetRef. A minimal (i.e. base class) DataCoordinate is constructed from row if None.

records : Mapping [ str, Mapping [ tuple, DimensionRecord ] ]

Records to use to return an ExpandedDataCoordinate. If provided, outer keys must include all dimension element names in graph, and inner keys should be tuples of dimension primary key values in the same order as element.graph.required.

Returns:
ref : DatasetRef

Reference to the dataset; guaranteed to have DatasetRef.id not None.

extractDimensionsTuple(row: Optional[sqlalchemy.engine.RowProxy], dimensions: Iterable[Dimension]) → tuple

Extract a tuple of data ID values from a result row.

Parameters:
row : sqlalchemy.engine.RowProxy or None

A result row from a SQLAlchemy SELECT query, or None to indicate the row from an EmptyQuery.

dimensions : Iterable [ Dimension ]

The dimensions to include in the returned tuple, in order.

Returns:
values : tuple

A tuple of dimension primary key values.

getDatasetColumns() → Optional[lsst.daf.butler.registry.queries._structs.DatasetQueryColumns]

Return the columns for the datasets returned by this query.

Returns:
columns : DatasetQueryColumns or None

Struct containing SQLAlchemy representations of the result columns for a dataset.

Notes

This method is intended primarily as a hook for subclasses to implement and the ABC to call in order to provide higher-level functionality; code that uses Query objects (but does not implement one) should usually not have to call this method.

getDimensionColumn(name: str) → sqlalchemy.sql.elements.ColumnElement

Return the query column that contains the primary key value for the dimension with the given name.

Parameters:
name : str

Name of the dimension.

Returns:
column : sqlalchemy.sql.ColumnElement.

SQLAlchemy object representing a column in the query.

Notes

This method is intended primarily as a hook for subclasses to implement and the ABC to call in order to provide higher-level functionality; code that uses Query objects (but does not implement one) should usually not have to call this method.

getRegionColumn(name: str) → sqlalchemy.sql.elements.ColumnElement

Return a region column for one of the dimension elements iterated over by spatial.

Parameters:
name : str

Name of the element.

Returns:
column : sqlalchemy.sql.ColumnElement

SQLAlchemy representing a result column in the query.

Notes

This method is intended primarily as a hook for subclasses to implement and the ABC to call in order to provide higher-level functionality; code that uses Query objects (but does not implement one) should usually not have to call this method.

isUnique() → bool

Return True if this query’s rows are guaranteed to be unique, and False otherwise.

If this query has dataset results (datasetType is not None), uniqueness applies to the DatasetRef instances returned by extractDatasetRef from the result of rows. If it does not have dataset results, uniqueness applies to the DataCoordinate instances returned by extractDataId.

makeBuilder(summary: Optional[QuerySummary] = None) → QueryBuilder

Return a QueryBuilder that can be used to construct a new Query that is joined to (and hence constrained by) this one.

Parameters:
summary : QuerySummary, optional

A QuerySummary instance that specifies the dimensions and any additional constraints to include in the new query being constructed, or None to use the dimensions of self with no additional constraints.

materialize(db: lsst.daf.butler.registry.interfaces._database.Database) → Iterator[lsst.daf.butler.registry.queries._query.Query]

Execute this query and insert its results into a temporary table.

Parameters:
db : Database

Database engine to execute the query against.

Returns:
context : typing.ContextManager [ MaterializedQuery ]

A context manager that ensures the temporary table is created and populated in __enter__ (returning a MaterializedQuery object backed by that table), and dropped in __exit__. If self is already a MaterializedQuery, __enter__ may just return self and __exit__ may do nothing (reflecting the fact that an outer context manager should already take care of everything else).

rows(db: Database, *, region: Optional[Region] = None) → Iterator[Optional[sqlalchemy.engine.RowProxy]]

Execute the query and yield result rows, applying predicate.

Parameters:
db : Database

Object managing the database connection.

region : sphgeom.Region, optional

A region that any result-row regions must overlap in order to be yielded. If not provided, this will be self.whereRegion, if that exists.

Yields:
row : sqlalchemy.engine.RowProxy or None

Result row from the query. None may yielded exactly once instead of any real rows to indicate an empty query (see EmptyQuery).

subset(*, graph: Optional[lsst.daf.butler.core.dimensions._graph.DimensionGraph] = None, datasets: bool = True, unique: bool = False) → lsst.daf.butler.registry.queries._query.Query

Return a new Query whose columns and/or rows are (mostly) subset of this one’s.

Parameters:
graph : DimensionGraph, optional

Dimensions to include in the new Query being constructed. If None (default), self.graph is used.

datasets : bool, optional

Whether the new Query should include dataset results. Defaults to True, but is ignored if self does not include dataset results.

unique : bool, optional

Whether the new Query should guarantee unique results (this may come with a performance penalty).

Returns:
query : Query

A query object corresponding to the given inputs. May be self if no changes were requested.

Notes

The way spatial overlaps are handled at present makes it impossible to fully guarantee in general that the new query’s rows are a subset of this one’s while also returning unique rows. That’s because the database is only capable of performing approximate, conservative overlaps via the common skypix system; we defer actual region overlap operations to per-result-row Python logic. But including the region columns necessary to do that postprocessing in the query makes it impossible to do a SELECT DISTINCT on the user-visible dimensions of the query. For example, consider starting with a query with dimensions (instrument, skymap, visit, tract). That involves a spatial join between visit and tract, and we include the region columns from both tables in the results in order to only actually yield result rows (see predicate and rows) where the regions in those two columns overlap. If the user then wants to subset to just (skymap, tract) with unique results, we have two unpalatable options:

  • we can do a SELECT DISTINCT with just the skymap and tract columns in the SELECT clause, dropping all detailed overlap information and including some tracts that did not actually overlap any of the visits in the original query (but were regarded as _possibly_ overlapping via the coarser, common-skypix relationships);
  • we can include the tract and visit region columns in the query, and continue to filter out the non-overlapping pairs, but completely disregard the user’s request for unique tracts.

This interface specifies that implementations must do the former, as that’s what makes things efficient in our most important use case (QuantumGraph generation in pipe_base). We may be able to improve this situation in the future by putting exact overlap information in the database, either by using built-in (but engine-specific) spatial database functionality or (more likely) switching to a scheme in which pairwise dimension spatial relationships are explicitly precomputed (for e.g. combinations of instruments and skymaps).