Relation class represents the core concept of relational algebra: a table with a well-defined set of columns and unique rows.
Relation instance does not necessarily correspond to a concrete in-memory or on-disk table, however; most derived Relation types actually represent an operation on some other “target” relation or relations, forming an expression tree.
Operations on relations are represented by subclasses of
BinaryOperation, which are associated with a target relation to form a new relation via the
LeafRelation class handles relations that represent direct storage of rows (and in some cases actually do store rows themselves).
Relation interface provides factory methods for constructing and applying operations to relation that should be used instead of directly interacting with operation classes when possible.
The fourth and final
Relation class is
MarkerRelation, which adds some information or context to a target relation without changing the relation’s
Unlike other relation types,
MarkerRelation can be inherited from freely.
Relation classes (including extensions) should be frozen dataclasses, ensuring that they are immutable (with the exception of
Relation.payload), equality comparable, hashable, and that they have a useful (but in still lossy)
Relation classes also provide a
str representation that is much more concise, for cases where seeing the overall tree is more important than the details of any particular relation.
Relations are associated with “engines”: systems that may hold the actual data a relation (at least) conceptually represents and can perform operations on them to obtain the derived data.
These are represented by
Engine instances held by the relation objects themselves, and the
iteration subpackages provide at least partial implementations of engines for relations backed by SQL databases (via SQLAlchemy) and native Python iterables, respectively.
A relation tree can include multiple engines, by using a
Transfer relation class to mark the boundaries between them.
Processor class can be used to execute multiple-engine trees.
It is up to an engine how strictly its operations adhere to relational algebra operation definition.
SQL is formally defined in terms of operations on “bags” or “multisets”, whose rows are not unique and sometimes ordered, while formal relations are always unordered and unique.
Relation interface has more a more permissive view of uniqueness to facilitate interaction with SQL: a
Relation may have non-unique rows, but any duplicates are not meaningful, and hence most operations may remove or propagate duplicates at their discretion, though engines may make stronger guarantees and most relations are not permitted to introduce duplication when applied to a base relation with unique rows.
It is also up to engines to determine whether an operations maintains the order of rows; SQL engine operations often do not, while the iteration engine’s operations always do.
MarkerRelation objects can have an engine-specific
payload attribute that either holds the actual relation state for that engine or a reference to it.
iteration engine’s payloads are instances of the
iteration.RowIterable interface, while the SQL engine’s payloads are
sql.Payload instances, which can represent a SQLAlchemy table or subquery expression.
LeafRelation objects always have a payload that is not
Materialization markers indicate that a payload should be attached when the relation is first “executed” by an engine or the
Processor class, allowing subsequent executions to reuse that payload and avoid repeating upstream execution.
Attaching a payload is the only way a relation can be modified after construction, and a payload that is not
None can never be replaced.
The full set of unary and binary operations provided is given below, along with the
Relation factory methods that can be used to apply certain operations directly.
Applying an operation to a relation always returns a new relation (unless the operation is a no-op, in which case the original may be returned unchanged), and always acts lazily: applying an operation is not the same as processing or executing a relation tree that contains that operation.
Add a new column whose values are calculated from one or more existing columns, via by a
Concatenate the rows of two relations that have the same columns. This is equivalent to
UNION ALLin SQL or
Perform a natural join: combine two relations by matching rows with the same values in their common columns (and satisfying an optional column expression, via a
Predicate), producing a new relation whose columns are the union of the columns of its operands. This is equivalent to [
Remove some columns from a relation.
An intermediate abstract base class for unary operations that only change the order of rows.
An intermediate abstract base class for unary operations that only remove rows.
Remove rows that satisfy a
boolean column expression. This is equivalent to the
WHEREclause in SQL.
Remove rows outside an integer range of indices. This is equivalent to
LIMITin SQL, or indexing with
start:stopsyntax in Python.
Sort rows according to a
Many relation operations involve column expressions, such as the boolean filter used in a
Selection or the sort keys used in a
These are represented by the
ColumnExpression (for general scalar-valued expressions),
Predicate (for boolean expressions), and
ColumnContainer (for expressions that hold multiple values) abstract base classes.
These base classes provide factory methods for all derived classes, making it generally unnecessary to refer to those types directly (except when writing an algorithm or engine that processes a relation tree; see Extensibility).
Column expression objects can in general support multiple engines; some types are required to be supported by all engines, while others can hold a list of engine types that support them.
The concrete column expression classes provided by
lsst.daf.relation are given below, with their factory methods:
A constant scalar, non-boolean Python value.
A reference to a scalar, non-boolean column in a relation.
A named function that returns a scalar, non-boolean value given scalar, non-boolean arguments. It is up to each
Enginehow and whether it supports a
ColumnFunction; this could include looking up the name in some module or treating it as a method that should be present on some object that represents a column value more directly in that engine.
A sequence of one or more
ColumnExpressionobjects, representing the same column type (but not neecessarily the same expression type.
A virtual sequence of literal integer values represented by a Python
A reference to a boolean-valued column in a relation.
A named function that returns a boolean, given scalar, non-boolean arguments. Like
ColumnFunction, implementation and support are engine-specific.
gemethods for the common case of
PredicateFunctionobjects that represent comparison operators.
Ideally, this library would be extensible in three different ways:
Relationor operation types could be defined, representing new kinds of nodes in a relation expression tree;
Enginetypes could be defined, representing different ways of storing and manipulating tabular data;
external algorithms on relation trees could be defined.
Unfortunately, any custom
Engine or relation-tree algorithm in practice needs to be able to enumerate all possible
Relation and operation types (or at least reject any trees with
Relation or operation types it does not recognize).
Similarly, any custom
Relation or operation type would need to be able to enumerate the algorithms and engines it supports, in order to provide its own implementations for those algorithms and engines.
This is a fragile multiple-dispatch problem.
To simplify things, this package chooses to prohibit most kinds of
Relation and operation extensibility:
BinaryOperationand column expression subclasses are not permitted;
These prohibitions are enforced by
__init_subclass__ checks in the abstract base classes.
Relation is actually a
typing.Protocol, not (just) an ABC, and the concrete
MarkerRelation classes actually inherit from
BaseRelation while satisfying the
Relation interface only in a structural subtyping sense.
This allows various
Relation attribute interfaces (e.g.
Relation.engine) to be implemented as either true properties or dataclass fields, and it should be invisible to users except in the rare case that they need to perform a runtime
isinstance check with the
Relation type itself, not just a specific concrete
Relation subclass: in this case
BaseRelation must be used instead of
The standard approach to designs like this in object oriented programming is the Visitor Pattern, which in this case would involve a base class or suite of base classes for algorithms or engines with a method for each possible relation-tree node type (relations, operations, column expressions); these would be invoked by a method on each node interface whose concrete implementations call the corresponding algorithm or engine method. This implicitly restricts the set of node tree types to those with methods in the algorithm/engine base class.
In languages with functional programming aspects, a much more direct approach involving enumerated types and pattern-matching syntax is often possible.
With the introduction of the
match statement in Python 3.10, this now includes Python, and this is the approach taken in here.
This results in much more readable and concise code than the boilerplate-heavy visitor-pattern alternative, but it comes with a significant drawback: there are no checks (either at runtime or via static type checkers like MyPy) that all necessary
case branches of a
match are present.
This is in part by design - many algorithms on relation trees can act generically on most operation types, and hence need to only explicitly match one or two - but it requires considerable discipline from algorithm and engine authors to ensure that match logic is correct and well-tested.
Another (smaller) drawback is that it can occasionally yield code that in other contexts might be considered antipatterns (e.g.
isinstance is often a good alternative to a single-branch
Python API reference¶
Flatten all logical AND operations in predicate into a
An implementation-focused target class for concrete
An abstract base class for operations that act on a pair of relations.
A relation operation that adds a new column from an expression involving existing columns.
A relation operation that concatenates the rows of a pair of relations with the same columns.
A abstract base class and factory for expressions that represent containers of multiple column values.
Exception type raised to indicate problems in a relation's columns and/or unique keys.
An abstract base class and factory for scalar, non-boolean column expressions.
A container expression backed by a sequence of scalar column expressions.
A concrete column expression that represents calling a named function with column expression arguments.
A boolean column expression that tests whether a scalar column expression is present in a container expression.
A concrete column expression backed by a regular Python value.
A container expression backed by a range of integer indices.
A concrete column expression that refers to a relation column.
An interface for objects that represent columns in a relation.
A relation operation that removes duplicate rows.
A relation-processing algorithm that attempts to explain why a relation has no rows.
An abstract interface for the systems that hold relation data and know how to process relation trees.
Exception type raised to indicate that a relation tree is incompatible with an engine, or has inconsistent engines.
An implementation-focused base class for
A concrete unary operation that does nothing.
A binary operation that passes through one of its operands and ignores the other.
A natural join operation.
A concrete boolean column expression that ANDs its operands.
A concrete boolean column expression that inverts its operand.
A concrete boolean column expression that ORs its operands.
An extensible relation base class that provides additional information about another relation without changing its row-and-column content.
A marker operation that indicates that the upstream tree should be evaluated only once, with the results saved and reused for subsequent processing.
An abstract base class and factory for boolean column expressions.
A concrete boolean expression that represents calling an named function with column expression arguments.
A concrete boolean column expression that refers to a boolean relation column.
An inheritable framework for processing multi-engine relation trees.
A unary operation that removes one or more columns.
An abstract interface for expression trees on tabular data.
Exception type raised to indicate problems in the definition of a relation tree.
A relation operation that filters rows according to a boolean column expression.
A relation relation that filters rows that are outside a range of positional indices.
A relation operation that orders rows according to a sequence of column expressions.
Sort expression and indication of sort direction.
A struct for the return value of
An abstract base class for operations that act on a single relation.