MultilevelParquetTable¶
- class lsst.pipe.tasks.parquetTable.MultilevelParquetTable(*args, **kwargs)¶
- Bases: - ParquetTable- Wrapper to access dataframe with multi-level column index from Parquet - This subclass of - ParquetTableto handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with a- pandas.DataFrame.- Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as - df[('ref', 'HSC-G', 'coord_ra')]. However, for some reason pyarrow saves these indices as “stringified” tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following:- pf = pyarrow.ParquetFile(filename) df = pf.read(columns=[“(‘ref’, ‘HSC-G’, ‘coord_ra’)”]) - See also https://github.com/apache/arrow/issues/1771, where we’ve raised this issue. - As multilevel-indexed dataframes can be very useful to store data like multiple filters’ worth of data in the same table, this case deserves a wrapper to enable easier access; that’s what this object is for. For example, - parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’, - ‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]} - df = parq.toDataFrame(columns=columnDict) - will return just the coordinate columns; the equivalent of calling - df['meas']['HSC-G'][['coord_ra', 'coord_dec']]on the total dataframe, but without having to load the whole frame into memory—this reads just those columns from disk. You can also request a sub-table; e.g.,- parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’, - ‘filter’:’HSC-G’} - df = parq.toDataFrame(columns=columnDict) - and this will be the equivalent of - df['meas']['HSC-G']on the total dataframe.- Parameters:
- filenamestr, optional
- Path to Parquet file. - dataFrame : dataFrame, optional 
- .. deprecated:: v25
- The MultilevelParquetTable interface is from Gen2 i/o and will be removed after v26. 
 
 - Attributes Summary - Columns as a pandas Index - Names of levels in column index - List of column names (or column index if df is set) - Methods Summary - toDataFrame([columns, droplevels])- Get table (or specified columns) as a pandas DataFrame - write(filename)- Write pandas dataframe to parquet - Attributes Documentation - columnIndex¶
- Columns as a pandas Index 
 - columnLevelNames¶
 - columnLevels¶
- Names of levels in column index 
 - columns¶
- List of column names (or column index if df is set) - This may either be a list of column names, or a pandas.Index object describing the column index, depending on whether the ParquetTable object is wrapping a ParquetFile or a DataFrame. 
 - pandasMd¶
 - Methods Documentation - toDataFrame(columns=None, droplevels=True)¶
- Get table (or specified columns) as a pandas DataFrame - To get specific columns in specified sub-levels: - parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’, - ‘filter’:’HSC-G’, ‘column’:[‘coord_ra’, ‘coord_dec’]} - df = parq.toDataFrame(columns=columnDict) - Or, to get an entire subtable, leave out one level name: - parq = MultilevelParquetTable(filename) columnDict = {‘dataset’:’meas’, - ‘filter’:’HSC-G’} - df = parq.toDataFrame(columns=columnDict) - Parameters:
- columnslist or dict, optional
- Desired columns. If - None, then all columns will be returned. If a list, then the names of the columns must be exactly as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, the- columnLevelsattribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included.
- droplevelsbool
- If True drop levels of column index that have just one entry 
 
 
 - write(filename)¶
- Write pandas dataframe to parquet - Parameters:
- filenamestr
- Path to which to write.