MultilevelParquetTable¶
-
class
lsst.pipe.tasks.parquetTable.
MultilevelParquetTable
(*args, **kwargs)¶ Bases:
lsst.pipe.tasks.parquetTable.ParquetTable
Wrapper to access dataframe with multi-level column index from Parquet
This subclass of
ParquetTable
to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with apandas.DataFrame
.Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as
df[('ref', 'HSC-G', 'coord_ra')]
. However, for some reason pyarrow saves these indices as “stringified” tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following:pf = pyarrow.ParquetFile(filename) df = pf.read(columns=[“(‘ref’, ‘HSC-G’, ‘coord_ra’)”])See also https://github.com/apache/arrow/issues/1771, where we’ve raised this issue.
As multilevel-indexed dataframes can be very useful to store data like multiple filters’ worth of data in the same table, this case deserves a wrapper to enable easier access; that’s what this object is for. For example,
parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}df = parq.toDataFrame(columns=columnDict)
will return just the coordinate columns; the equivalent of calling
df['meas']['HSC-G'][['coord_ra', 'coord_dec']]
on the total dataframe, but without having to load the whole frame into memory—this reads just those columns from disk. You can also request a sub-table; e.g.,parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’}df = parq.toDataFrame(columns=columnDict)
and this will be the equivalent of
df['meas']['HSC-G']
on the total dataframe.Parameters: - filename : str, optional
Path to Parquet file.
- dataFrame : dataFrame, optional
Attributes Summary
columnIndex
Columns as a pandas Index columnLevelNames
columnLevels
Names of levels in column index columns
List of column names (or column index if df is set) pandasMd
Methods Summary
toDataFrame
([columns, droplevels])Get table (or specified columns) as a pandas DataFrame write
(filename)Write pandas dataframe to parquet Attributes Documentation
-
columnIndex
¶ Columns as a pandas Index
-
columnLevelNames
¶
-
columnLevels
¶ Names of levels in column index
-
columns
¶ List of column names (or column index if df is set)
This may either be a list of column names, or a pandas.Index object describing the column index, depending on whether the ParquetTable object is wrapping a ParquetFile or a DataFrame.
-
pandasMd
¶
Methods Documentation
-
toDataFrame
(columns=None, droplevels=True)¶ Get table (or specified columns) as a pandas DataFrame
To get specific columns in specified sub-levels:
parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]}df = parq.toDataFrame(columns=columnDict)
Or, to get an entire subtable, leave out one level name:
parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’,
‘filter’:’HSC-G’}df = parq.toDataFrame(columns=columnDict)
Parameters: - columns : list or dict, optional
Desired columns. If
None
, then all columns will be returned. If a list, then the names of the columns must be exactly as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, thecolumnLevels
attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included.- droplevels : bool
If True drop levels of column index that have just one entry
-
write
(filename)¶ Write pandas dataframe to parquet
Parameters: - filename : str
Path to which to write.