lsst.daf.relation¶
Overview¶
The Relation class represents the core concept of relational algebra: a table with a well-defined set of columns and unique rows.
A Relation instance does not necessarily correspond to a concrete in-memory or on-disk table, however; most derived Relation types actually represent an operation on some other “target” relation or relations, forming an expression tree.
Operations on relations are represented by subclasses of UnaryOperation and BinaryOperation, which are associated with a target relation to form a new relation via the UnaryOperationRelation and BinaryOperationRelation classes.
The LeafRelation class handles relations that represent direct storage of rows (and in some cases actually do store rows themselves).
The Relation interface provides factory methods for constructing and applying operations to relation that should be used instead of directly interacting with operation classes when possible.
The fourth and final Relation class is MarkerRelation, which adds some information or context to a target relation without changing the relation’s
actual content.
Unlike other relation types, MarkerRelation can be inherited from freely.
Concrete Relation classes (including extensions) should be frozen dataclasses, ensuring that they are immutable (with the exception of Relation.payload), equality comparable, hashable, and that they have a useful (but in still lossy) repr.
Relation classes also provide a str representation that is much more concise, for cases where seeing the overall tree is more important than the details of any particular relation.
Engines¶
Relations are associated with “engines”: systems that may hold the actual data a relation (at least) conceptually represents and can perform operations on them to obtain the derived data.
These are represented by Engine instances held by the relation objects themselves, and the sql and iteration subpackages provide at least partial implementations of engines for relations backed by SQL databases (via SQLAlchemy) and native Python iterables, respectively.
A relation tree can include multiple engines, by using a Transfer relation class to mark the boundaries between them.
The Processor class can be used to execute multiple-engine trees.
It is up to an engine how strictly its operations adhere to relational algebra operation definition.
SQL is formally defined in terms of operations on “bags” or “multisets”, whose rows are not unique and sometimes ordered, while formal relations are always unordered and unique.
The Relation interface has more a more permissive view of uniqueness to facilitate interaction with SQL: a Relation may have non-unique rows, but any duplicates are not meaningful, and hence most operations may remove or propagate duplicates at their discretion, though engines may make stronger guarantees and most relations are not permitted to introduce duplication when applied to a base relation with unique rows.
It is also up to engines to determine whether an operations maintains the order of rows; SQL engine operations often do not, while the iteration engine’s operations always do.
LeafRelation and MarkerRelation objects can have an engine-specific payload attribute that either holds the actual relation state for that engine or a reference to it.
The iteration engine’s payloads are instances of the iteration.RowIterable interface, while the SQL engine’s payloads are sql.Payload instances, which can represent a SQLAlchemy table or subquery expression.
LeafRelation objects always have a payload that is not None.
Materialization markers indicate that a payload should be attached when the relation is first “executed” by an engine or the Processor class, allowing subsequent executions to reuse that payload and avoid repeating upstream execution.
Attaching a payload is the only way a relation can be modified after construction, and a payload that is not None can never be replaced.
Operations¶
The full set of unary and binary operations provided is given below, along with the Relation factory methods that can be used to apply certain operations directly.
Applying an operation to a relation always returns a new relation (unless the operation is a no-op, in which case the original may be returned unchanged), and always acts lazily: applying an operation is not the same as processing or executing a relation tree that contains that operation.
Calculation(UnaryOperation) /Relation.with_calculated_columnAdd a new column whose values are calculated from one or more existing columns, via by a
column expression.Chain(BinaryOperation) /Relation.chainConcatenate the rows of two relations that have the same columns. This is equivalent to
UNION ALLin SQL oritertools.chainin Python.Deduplication(UnaryOperation) /Relation.without_duplicatesRemove duplicate rows. This is equivalent to
SELECT DISTINCTin SQL or filtering throughsetordictin Python.Identity(UnaryOperation)Do nothing. This operation never actually appears in
Relationtrees;Identity.applyalways returns the operand relation passed to it.Join(BinaryOperation) /Relation.joinPerform a natural join: combine two relations by matching rows with the same values in their common columns (and satisfying an optional column expression, via a
Predicate), producing a new relation whose columns are the union of the columns of its operands. This is equivalent to [INNER]JOINin SQL.Projection(UnaryOperation) /Relation.with_only_columnsRemove some columns from a relation.
Reordering(UnaryOperation)An intermediate abstract base class for unary operations that only change the order of rows.
RowFilter(UnaryOperation)An intermediate abstract base class for unary operations that only remove rows.
Selection(RowFilter) /Relation.with_rows_satisfyingRemove rows that satisfy a
boolean column expression. This is equivalent to theWHEREclause in SQL.Slice(RowFilter) /Relation.__getitem__Remove rows outside an integer range of indices. This is equivalent to
OFFSETandLIMITin SQL, or indexing withsliceobject orstart:stopsyntax in Python.Sort(Reordering) /Relation.sortedSort rows according to a
column expression.
Column Expressions¶
Many relation operations involve column expressions, such as the boolean filter used in a Selection or the sort keys used in a Sort.
These are represented by the ColumnExpression (for general scalar-valued expressions), Predicate (for boolean expressions), and ColumnContainer (for expressions that hold multiple values) abstract base classes.
These base classes provide factory methods for all derived classes, making it generally unnecessary to refer to those types directly (except when writing an algorithm or engine that processes a relation tree; see Extensibility).
Column expression objects can in general support multiple engines; some types are required to be supported by all engines, while others can hold a list of engine types that support them.
The concrete column expression classes provided by lsst.daf.relation are given below, with their factory methods:
ColumnLiteral/ColumnExpression.literalA constant scalar, non-boolean Python value.
ColumnReference/ColumnExpression.referenceA reference to a scalar, non-boolean column in a relation.
ColumnFunction` /ColumnExpression.function,ColumnExpression.methodA named function that returns a scalar, non-boolean value given scalar, non-boolean arguments. It is up to each
Enginehow and whether it supports aColumnFunction; this could include looking up the name in some module or treating it as a method that should be present on some object that represents a column value more directly in that engine.ColumnExpressionSequence/ColumnContainer.sequenceA sequence of one or more
ColumnExpressionobjects, representing the same column type (but not neecessarily the same expression type.ColumnRangeLiteral/ColumnContainer.range_literalA virtual sequence of literal integer values represented by a Python
rangeobject.PredicateLiteral/Predicate.literalPredicateReference/Predicate.referenceA reference to a boolean-valued column in a relation.
PredicateFunction/ColumnExpression.predicate_function,ColumnExpression.predicate_methodA named function that returns a boolean, given scalar, non-boolean arguments. Like
ColumnFunction, implementation and support are engine-specific.ColumnExpressionalso haseq,ne,lt,le,gt, andgemethods for the common case ofPredicateFunctionobjects that represent comparison operators.LogicalNot/Predicate.logical_notBoolean expression that is
Trueif its (single, boolean) argument isFalse, and vice versa.LogicalAnd/Predicate.logical_andBoolean expression that is
Trueonly if all (boolean) arguments areTrue.LogicalOr/Predicate.logical_orBoolean expression that is
Trueif any (boolean) argument isTrue.ColumnInContainer/ColumnContainer.containsBoolean expression that tests whether a scalar
ColumnExpressionis included in aColumnContainer.
Extensibility¶
Ideally, this library would be extensible in three different ways:
external
Relationor operation types could be defined, representing new kinds of nodes in a relation expression tree;external
Enginetypes could be defined, representing different ways of storing and manipulating tabular data;external algorithms on relation trees could be defined.
Unfortunately, any custom Engine or relation-tree algorithm in practice needs to be able to enumerate all possible Relation and operation types (or at least reject any trees with Relation or operation types it does not recognize).
Similarly, any custom Relation or operation type would need to be able to enumerate the algorithms and engines it supports, in order to provide its own implementations for those algorithms and engines.
This is a fragile multiple-dispatch problem.
To simplify things, this package chooses to prohibit most kinds of Relation and operation extensibility:
custom
Relationsubclasses must beMarkerRelationsubclasses;custom
BinaryOperationand column expression subclasses are not permitted;custom subclasses of
UnaryOperationare restricted to subclasses of the more limitedRowFilterandReorderingintermediate interfaces.
These prohibitions are enforced by __init_subclass__ checks in the abstract base classes.
Note
Relation is actually a typing.Protocol, not (just) an ABC, and the concrete LeafRelation, UnaryOperationRelation, BinaryOperationRelation, and MarkerRelation classes actually inherit from BaseRelation while satisfying the Relation interface only in a structural subtyping sense.
This allows various Relation attribute interfaces (e.g. Relation.engine) to be implemented as either true properties or dataclass fields, and it should be invisible to users except in the rare case that they need to perform a runtime isinstance check with the Relation type itself, not just a specific concrete Relation subclass: in this case BaseRelation must be used instead of Relation.
The standard approach to designs like this in object oriented programming is the Visitor Pattern, which in this case would involve a base class or suite of base classes for algorithms or engines with a method for each possible relation-tree node type (relations, operations, column expressions); these would be invoked by a method on each node interface whose concrete implementations call the corresponding algorithm or engine method. This implicitly restricts the set of node tree types to those with methods in the algorithm/engine base class.
In languages with functional programming aspects, a much more direct approach involving enumerated types and pattern-matching syntax is often possible.
With the introduction of the match statement in Python 3.10, this now includes Python, and this is the approach taken in here.
This results in much more readable and concise code than the boilerplate-heavy visitor-pattern alternative, but it comes with a significant drawback: there are no checks (either at runtime or via static type checkers like MyPy) that all necessary case branches of a match are present.
This is in part by design - many algorithms on relation trees can act generically on most operation types, and hence need to only explicitly match one or two - but it requires considerable discipline from algorithm and engine authors to ensure that match logic is correct and well-tested.
Another (smaller) drawback is that it can occasionally yield code that in other contexts might be considered antipatterns (e.g. isinstance is often a good alternative to a single-branch match).
Provided Engines¶
Changes¶
Contributing¶
lsst.daf.relation is developed at https://github.com/lsst-dm/daf_relation.
You can find Jira issues for this module under the daf_relation component.
Python API reference¶
lsst.daf.relation Package¶
Functions¶
|
Flatten all logical AND operations in predicate into a |
Classes¶
An implementation-focused target class for concrete |
|
An abstract base class for operations that act on a pair of relations. |
|
|
A concrete |
|
A relation operation that adds a new column from an expression involving existing columns. |
|
A relation operation that concatenates the rows of a pair of relations with the same columns. |
A abstract base class and factory for expressions that represent containers of multiple column values. |
|
Exception type raised to indicate problems in a relation's columns and/or unique keys. |
|
An abstract base class and factory for scalar, non-boolean column expressions. |
|
|
A container expression backed by a sequence of scalar column expressions. |
|
A concrete column expression that represents calling a named function with column expression arguments. |
|
A boolean column expression that tests whether a scalar column expression is present in a container expression. |
|
A concrete column expression backed by a regular Python value. |
|
A container expression backed by a range of integer indices. |
|
A concrete column expression that refers to a relation column. |
|
An interface for objects that represent columns in a relation. |
A relation operation that removes duplicate rows. |
|
|
A relation-processing algorithm that attempts to explain why a relation has no rows. |
|
An abstract interface for the systems that hold relation data and know how to process relation trees. |
Exception type raised to indicate that a relation tree is incompatible with an engine, or has inconsistent engines. |
|
|
An implementation-focused base class for |
|
A concrete unary operation that does nothing. |
|
A binary operation that passes through one of its operands and ignores the other. |
|
A natural join operation. |
|
A |
|
A concrete boolean column expression that ANDs its operands. |
|
A concrete boolean column expression that inverts its operand. |
|
A concrete boolean column expression that ORs its operands. |
|
An extensible relation base class that provides additional information about another relation without changing its row-and-column content. |
|
A marker operation that indicates that the upstream tree should be evaluated only once, with the results saved and reused for subsequent processing. |
|
A |
An abstract base class and factory for boolean column expressions. |
|
|
A concrete boolean expression that represents calling an named function with column expression arguments. |
|
A concrete boolean column expression that is a constant |
|
A concrete boolean column expression that refers to a boolean relation column. |
An inheritable framework for processing multi-engine relation trees. |
|
|
A unary operation that removes one or more columns. |
|
An abstract interface for expression trees on tabular data. |
Exception type raised to indicate problems in the definition of a relation tree. |
|
An extensible |
|
An extensible |
|
|
A relation operation that filters rows according to a boolean column expression. |
|
A relation relation that filters rows that are outside a range of positional indices. |
|
A relation operation that orders rows according to a sequence of column expressions. |
|
Sort expression and indication of sort direction. |
|
A |
|
A struct for the return value of |
An abstract base class for operations that act on a single relation. |
|
|
A concrete |