lsst.daf.relation¶
Overview¶
The Relation
class represents the core concept of relational algebra: a table with a well-defined set of columns and unique rows.
A Relation
instance does not necessarily correspond to a concrete in-memory or on-disk table, however; most derived Relation types actually represent an operation on some other “target” relation or relations, forming an expression tree.
Operations on relations are represented by subclasses of UnaryOperation
and BinaryOperation
, which are associated with a target relation to form a new relation via the UnaryOperationRelation
and BinaryOperationRelation
classes.
The LeafRelation
class handles relations that represent direct storage of rows (and in some cases actually do store rows themselves).
The Relation
interface provides factory methods for constructing and applying operations to relation that should be used instead of directly interacting with operation classes when possible.
The fourth and final Relation
class is MarkerRelation
, which adds some information or context to a target relation without changing the relation’s
actual content.
Unlike other relation types, MarkerRelation
can be inherited from freely.
Concrete Relation
classes (including extensions) should be frozen dataclasses, ensuring that they are immutable (with the exception of Relation.payload
), equality comparable, hashable, and that they have a useful (but in still lossy) repr
.
Relation
classes also provide a str
representation that is much more concise, for cases where seeing the overall tree is more important than the details of any particular relation.
Engines¶
Relations are associated with “engines”: systems that may hold the actual data a relation (at least) conceptually represents and can perform operations on them to obtain the derived data.
These are represented by Engine
instances held by the relation objects themselves, and the sql
and iteration
subpackages provide at least partial implementations of engines for relations backed by SQL databases (via SQLAlchemy) and native Python iterables, respectively.
A relation tree can include multiple engines, by using a Transfer
relation class to mark the boundaries between them.
The Processor
class can be used to execute multiple-engine trees.
It is up to an engine how strictly its operations adhere to relational algebra operation definition.
SQL is formally defined in terms of operations on “bags” or “multisets”, whose rows are not unique and sometimes ordered, while formal relations are always unordered and unique.
The Relation
interface has more a more permissive view of uniqueness to facilitate interaction with SQL: a Relation
may have non-unique rows, but any duplicates are not meaningful, and hence most operations may remove or propagate duplicates at their discretion, though engines may make stronger guarantees and most relations are not permitted to introduce duplication when applied to a base relation with unique rows.
It is also up to engines to determine whether an operations maintains the order of rows; SQL engine operations often do not, while the iteration engine’s operations always do.
LeafRelation
and MarkerRelation
objects can have an engine-specific payload
attribute that either holds the actual relation state for that engine or a reference to it.
The iteration
engine’s payloads are instances of the iteration.RowIterable
interface, while the SQL engine’s payloads are sql.Payload
instances, which can represent a SQLAlchemy table or subquery expression.
LeafRelation
objects always have a payload that is not None
.
Materialization
markers indicate that a payload should be attached when the relation is first “executed” by an engine or the Processor
class, allowing subsequent executions to reuse that payload and avoid repeating upstream execution.
Attaching a payload is the only way a relation can be modified after construction, and a payload that is not None
can never be replaced.
Operations¶
The full set of unary and binary operations provided is given below, along with the Relation
factory methods that can be used to apply certain operations directly.
Applying an operation to a relation always returns a new relation (unless the operation is a no-op, in which case the original may be returned unchanged), and always acts lazily: applying an operation is not the same as processing or executing a relation tree that contains that operation.
Calculation
(UnaryOperation
) /Relation.with_calculated_column
Add a new column whose values are calculated from one or more existing columns, via by a
column expression
.Chain
(BinaryOperation
) /Relation.chain
Concatenate the rows of two relations that have the same columns. This is equivalent to
UNION ALL
in SQL oritertools.chain
in Python.Deduplication
(UnaryOperation
) /Relation.without_duplicates
Remove duplicate rows. This is equivalent to
SELECT DISTINCT
in SQL or filtering throughset
ordict
in Python.Identity
(Marker
)Do nothing. This operation never actually appears in
Relation
trees;Identity.apply
always returns the operand relation passed to it.Join
(BinaryOperation
) /Relation.join
Perform a natural join: combine two relations by matching rows with the same values in their common columns (and satisfying an optional column expression, via a
Predicate
), producing a new relation whose columns are the union of the columns of its operands. This is equivalent to [INNER
]JOIN
in SQL.Projection
(UnaryOperation
) /Relation.with_only_columns
Remove some columns from a relation.
Reordering
(UnaryOperation
)An intermediate abstract base class for unary operations that only change the order of rows.
RowFilter
(UnaryOperation
)An intermediate abstract base class for unary operations that only remove rows.
Selection
(RowFilter
) /Relation.with_rows_satisfying
Remove rows that satisfy a
boolean column expression
. This is equivalent to theWHERE
clause in SQL.Slice
(RowFilter
) /Relation.__getitem__
Remove rows outside an integer range of indices. This is equivalent to
OFFSET
andLIMIT
in SQL, or indexing withslice
object orstart:stop
syntax in Python.Sort
(Reordering
) /Relation.sorted
Sort rows according to a
column expression
.
Column Expressions¶
Many relation operations involve column expressions, such as the boolean filter used in a Selection
or the sort keys used in a Sort
.
These are represented by the ColumnExpression
(for general scalar-valued expressions), Predicate
(for boolean expressions), and ColumnContainer
(for expressions that hold multiple values) abstract base classes.
These base classes provide factory methods for all derived classes, making it generally unnecessary to refer to those types directly (except when writing an algorithm or engine that processes a relation tree; see Extensibility).
Column expression objects can in general support multiple engines; some types are required to be supported by all engines, while others can hold a list of engine types that support them.
The concrete column expression classes provided by lsst.daf.relation
are given below, with their factory methods:
ColumnLiteral
/ColumnExpression.literal
A constant scalar, non-boolean Python value.
ColumnReference
/ColumnExpression.reference
A reference to a scalar, non-boolean column in a relation.
ColumnFunction
` /ColumnExpression.function
,ColumnExpression.method
A named function that returns a scalar, non-boolean value given scalar, non-boolean arguments. It is up to each
Engine
how and whether it supports aColumnFunction
; this could include looking up the name in some module or treating it as a method that should be present on some object that represents a column value more directly in that engine.ColumnExpressionSequence
/ColumnContainer.sequence
A sequence of one or more
ColumnExpression
objects, representing the same column type (but not neecessarily the same expression type.ColumnRangeLiteral
/ColumnContainer.range_literal
A virtual sequence of literal integer values represented by a Python
range
object.PredicateLiteral
/Predicate.literal
PredicateReference
/Predicate.reference
A reference to a boolean-valued column in a relation.
PredicateFunction
/ColumnExpression.predicate_function
,ColumnExpression.predicate_method
A named function that returns a boolean, given scalar, non-boolean arguments. Like
ColumnFunction
, implementation and support are engine-specific.ColumnExpression
also haseq
,ne
,lt
,le
,gt
, andge
methods for the common case ofPredicateFunction
objects that represent comparison operators.LogicalNot
/Predicate.logical_not
Boolean expression that is
True
if its (single, boolean) argument isFalse
, and vice versa.LogicalAnd
/Predicate.logical_and
Boolean expression that is
True
only if all (boolean) arguments areTrue
.LogicalOr
/Predicate.logical_or
Boolean expression that is
True
if any (boolean) argument isTrue
.ColumnInContainer
/ColumnContainer.contains
Boolean expression that tests whether a scalar
ColumnExpression
is included in aColumnContainer
.
Extensibility¶
Ideally, this library would be extensible in three different ways:
external
Relation
or operation types could be defined, representing new kinds of nodes in a relation expression tree;external
Engine
types could be defined, representing different ways of storing and manipulating tabular data;external algorithms on relation trees could be defined.
Unfortunately, any custom Engine
or relation-tree algorithm in practice needs to be able to enumerate all possible Relation
and operation types (or at least reject any trees with Relation
or operation types it does not recognize).
Similarly, any custom Relation
or operation type would need to be able to enumerate the algorithms and engines it supports, in order to provide its own implementations for those algorithms and engines.
This is a fragile multiple-dispatch problem.
To simplify things, this package chooses to prohibit most kinds of Relation
and operation extensibility:
custom
Relation
subclasses must beMarkerRelation
subclasses;custom
BinaryOperation
and column expression subclasses are not permitted;custom subclasses of
UnaryOperation
are restricted to subclasses of the more limitedRowFilter
andReordering
intermediate interfaces.
These prohibitions are enforced by __init_subclass__
checks in the abstract base classes.
Note
Relation
is actually a typing.Protocol
, not (just) an ABC, and the concrete LeafRelation
, UnaryOperationRelation
, BinaryOperationRelation
, and MarkerRelation
classes actually inherit from BaseRelation
while satisfying the Relation
interface only in a structural subtyping sense.
This allows various Relation
attribute interfaces (e.g. Relation.engine
) to be implemented as either true properties or dataclass fields, and it should be invisible to users except in the rare case that they need to perform a runtime isinstance
check with the Relation
type itself, not just a specific concrete Relation
subclass: in this case BaseRelation
must be used instead of Relation
.
The standard approach to designs like this in object oriented programming is the Visitor Pattern, which in this case would involve a base class or suite of base classes for algorithms or engines with a method for each possible relation-tree node type (relations, operations, column expressions); these would be invoked by a method on each node interface whose concrete implementations call the corresponding algorithm or engine method. This implicitly restricts the set of node tree types to those with methods in the algorithm/engine base class.
In languages with functional programming aspects, a much more direct approach involving enumerated types and pattern-matching syntax is often possible.
With the introduction of the match
statement in Python 3.10, this now includes Python, and this is the approach taken in here.
This results in much more readable and concise code than the boilerplate-heavy visitor-pattern alternative, but it comes with a significant drawback: there are no checks (either at runtime or via static type checkers like MyPy) that all necessary case
branches of a match
are present.
This is in part by design - many algorithms on relation trees can act generically on most operation types, and hence need to only explicitly match one or two - but it requires considerable discipline from algorithm and engine authors to ensure that match logic is correct and well-tested.
Another (smaller) drawback is that it can occasionally yield code that in other contexts might be considered antipatterns (e.g. isinstance
is often a good alternative to a single-branch match
).
Provided Engines¶
Contributing¶
lsst.daf.relation
is developed at https://github.com/lsst-dm/daf_relation.
You can find Jira issues for this module under the daf_relation component.
Python API reference¶
lsst.daf.relation Package¶
Functions¶
|
Flatten all logical AND operations in predicate into a |
Classes¶
An implementation-focused target class for concrete |
|
An abstract base class for operations that act on a pair of relations. |
|
|
A concrete |
|
A relation operation that adds a new column from an expression involving existing columns. |
|
A relation operation that concatenates the rows of a pair of relations with the same columns. |
A abstract base class and factory for expressions that represent containers of multiple column values. |
|
Exception type raised to indicate problems in a relation's columns and/or unique keys. |
|
An abstract base class and factory for scalar, non-boolean column expressions. |
|
|
A container expression backed by a sequence of scalar column expressions. |
|
A concrete column expression that represents calling a named function with column expression arguments. |
|
A boolean column expression that tests whether a scalar column expression is present in a container expression. |
|
A concrete column expression backed by a regular Python value. |
|
A container expression backed by a range of integer indices. |
|
A concrete column expression that refers to a relation column. |
|
An interface for objects that represent columns in a relation. |
A relation operation that removes duplicate rows. |
|
|
A relation-processing algorithm that attempts to explain why a relation has no rows. |
|
An abstract interface for the systems that hold relation data and know how to process relation trees. |
Exception type raised to indicate that a relation tree is incompatible with an engine, or has inconsistent engines. |
|
|
An implementation-focused base class for |
|
A concrete unary operation that does nothing. |
|
A binary operation that passes through one of its operands and ignores the other. |
|
A natural join operation. |
|
A |
|
A concrete boolean column expression that ANDs its operands. |
|
A concrete boolean column expression that inverts its operand. |
|
A concrete boolean column expression that ORs its operands. |
|
An extensible relation base class that provides additional information about another relation without changing its row-and-column content. |
|
A marker operation that indicates that the upstream tree should be evaluated only once, with the results saved and reused for subsequent processing. |
|
A |
An abstract base class and factory for boolean column expressions. |
|
|
A concrete boolean expression that represents calling an named function with column expression arguments. |
|
A concrete boolean column expression that is a constant |
|
A concrete boolean column expression that refers to a boolean relation column. |
An inheritable framework for processing multi-engine relation trees. |
|
|
A unary operation that removes one or more columns. |
|
An abstract interface for expression trees on tabular data. |
Exception type raised to indicate problems in the definition of a relation tree. |
|
An extensible |
|
An extensible |
|
|
A relation operation that filters rows according to a boolean column expression. |
|
A relation relation that filters rows that are outside a range of positional indices. |
|
A relation operation that orders rows according to a sequence of column expressions. |
|
|
|
A |
|
A struct for the return value of |
An abstract base class for operations that act on a single relation. |
|
|
A concrete |