MultilevelParquetTable

class lsst.pipe.tasks.parquetTable.MultilevelParquetTable(*args, **kwargs)

Bases: ParquetTable

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of ParquetTable to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with a pandas.DataFrame.

Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as df[('ref', 'HSC-G', 'coord_ra')]. However, for some reason pyarrow saves these indices as “stringified” tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following:

pf = pyarrow.ParquetFile(filename) df = pf.read(columns=[“(‘ref’, ‘HSC-G’, ‘coord_ra’)”])

See also https://github.com/apache/arrow/issues/1771, where we’ve raised this issue.

As multilevel-indexed dataframes can be very useful to store data like multiple filters’ worth of data in the same table, this case deserves a wrapper to enable easier access; that’s what this object is for. For example,

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}

df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling df['meas']['HSC-G'][['coord_ra', 'coord_dec']] on the total dataframe, but without having to load the whole frame into memory—this reads just those columns from disk. You can also request a sub-table; e.g.,

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’}

df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of df['meas']['HSC-G'] on the total dataframe.

Parameters:
filenamestr, optional

Path to Parquet file.

dataFrame : dataFrame, optional

.. deprecated:: v25

The MultilevelParquetTable interface is from Gen2 i/o and will be removed after v26.

Attributes Summary

columnIndex

Columns as a pandas Index

columnLevelNames

columnLevels

Names of levels in column index

columns

List of column names (or column index if df is set)

pandasMd

Methods Summary

toDataFrame([columns, droplevels])

Get table (or specified columns) as a pandas DataFrame

write(filename)

Write pandas dataframe to parquet

Attributes Documentation

columnIndex

Columns as a pandas Index

columnLevelNames
columnLevels

Names of levels in column index

columns

List of column names (or column index if df is set)

This may either be a list of column names, or a pandas.Index object describing the column index, depending on whether the ParquetTable object is wrapping a ParquetFile or a DataFrame.

pandasMd

Methods Documentation

toDataFrame(columns=None, droplevels=True)

Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}

df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,

‘filter’:’HSC-G’}

df = parq.toDataFrame(columns=columnDict)

Parameters:
columnslist or dict, optional

Desired columns. If None, then all columns will be returned. If a list, then the names of the columns must be exactly as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, the columnLevels attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included.

droplevelsbool

If True drop levels of column index that have just one entry

write(filename)

Write pandas dataframe to parquet

Parameters:
filenamestr

Path to which to write.