MultilevelParquetTable¶

class lsst.pipe.tasks.parquetTable.MultilevelParquetTable(*args, **kwargs)¶

Bases: ParquetTable

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of ParquetTable to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with a pandas.DataFrame.

Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as df[('ref', 'HSC-G', 'coord_ra')]. However, for some reason pyarrow saves these indices as “stringified” tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following:

pf = pyarrow.ParquetFile(filename) df = pf.read(columns=[“(‘ref’, ‘HSC-G’, ‘coord_ra’)”])

See also https://github.com/apache/arrow/issues/1771, where we’ve raised this issue.

As multilevel-indexed dataframes can be very useful to store data like multiple filters’ worth of data in the same table, this case deserves a wrapper to enable easier access; that’s what this object is for. For example,

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}

df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling df['meas']['HSC-G'][['coord_ra', 'coord_dec']] on the total dataframe, but without having to load the whole frame into memory—this reads just those columns from disk. You can also request a sub-table; e.g.,

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’}

df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of df['meas']['HSC-G'] on the total dataframe.

Parameters:

filenamestr, optional: Path to Parquet file.
dataFramedataFrame, optional

Attributes Summary

`columnIndex`	Columns as a pandas Index
`columnLevelNames`
`columnLevels`	Names of levels in column index
`columns`	List of column names (or column index if df is set)
`pandasMd`

Methods Summary

`toDataFrame`([columns, droplevels])	Get table (or specified columns) as a pandas DataFrame
`write`(filename)	Write pandas dataframe to parquet

Attributes Documentation

columnIndex¶: Columns as a pandas Index

columnLevelNames¶

columnLevels¶: Names of levels in column index

columns¶

List of column names (or column index if df is set)

This may either be a list of column names, or a pandas.Index object describing the column index, depending on whether the ParquetTable object is wrapping a ParquetFile or a DataFrame.

pandasMd¶

Methods Documentation

toDataFrame(columns=None, droplevels=True)¶

Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}

df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’}

df = parq.toDataFrame(columns=columnDict)

Parameters:

columnslist or dict, optional: Desired columns. If None, then all columns will be returned. If a list, then the names of the columns must be exactly as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, the columnLevels attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included.
droplevelsbool: If True drop levels of column index that have just one entry

write(filename)¶

Write pandas dataframe to parquet

Parameters:

filenamestr: Path to which to write.

Navigation

MultilevelParquetTable¶