MultilevelParquetTable¶
- class lsst.pipe.tasks.parquetTable.MultilevelParquetTable(*args, **kwargs)¶
Bases:
ParquetTable
Wrapper to access dataframe with multi-level column index from Parquet
This subclass of
ParquetTable
to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with apandas.DataFrame
.Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as
df[('ref', 'HSC-G', 'coord_ra')]
. However, for some reason pyarrow saves these indices as “stringified” tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following:pf = pyarrow.ParquetFile(filename) df = pf.read(columns=[“(‘ref’, ‘HSC-G’, ‘coord_ra’)”])
See also https://github.com/apache/arrow/issues/1771, where we’ve raised this issue.
As multilevel-indexed dataframes can be very useful to store data like multiple filters’ worth of data in the same table, this case deserves a wrapper to enable easier access; that’s what this object is for. For example,
parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}
df = parq.toDataFrame(columns=columnDict)
will return just the coordinate columns; the equivalent of calling
df['meas']['HSC-G'][['coord_ra', 'coord_dec']]
on the total dataframe, but without having to load the whole frame into memory—this reads just those columns from disk. You can also request a sub-table; e.g.,parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’}
df = parq.toDataFrame(columns=columnDict)
and this will be the equivalent of
df['meas']['HSC-G']
on the total dataframe.- Parameters:
- filenamestr, optional
Path to Parquet file.
- dataFramedataFrame, optional
Attributes Summary
Columns as a pandas Index
Names of levels in column index
List of column names (or column index if df is set)
Methods Summary
toDataFrame
([columns, droplevels])Get table (or specified columns) as a pandas DataFrame
write
(filename)Write pandas dataframe to parquet
Attributes Documentation
- columnIndex¶
Columns as a pandas Index
- columnLevelNames¶
- columnLevels¶
Names of levels in column index
- columns¶
List of column names (or column index if df is set)
This may either be a list of column names, or a pandas.Index object describing the column index, depending on whether the ParquetTable object is wrapping a ParquetFile or a DataFrame.
- pandasMd¶
Methods Documentation
- toDataFrame(columns=None, droplevels=True)¶
Get table (or specified columns) as a pandas DataFrame
To get specific columns in specified sub-levels:
parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}
df = parq.toDataFrame(columns=columnDict)
Or, to get an entire subtable, leave out one level name:
parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’}
df = parq.toDataFrame(columns=columnDict)
- Parameters:
- columnslist or dict, optional
Desired columns. If
None
, then all columns will be returned. If a list, then the names of the columns must be exactly as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, thecolumnLevels
attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included.- droplevelsbool
If True drop levels of column index that have just one entry
- write(filename)¶
Write pandas dataframe to parquet
- Parameters:
- filenamestr
Path to which to write.