Packaging data as a dataset¶
The dataset framework can represent data from any observatory that has an observatory interface (obs) package in the LSST software stack. This page describes how to create and maintain a dataset.
Creating a dataset repository¶
Datasets are Git LFS repositories with a particular directory and file structure. The easiest way to create a new dataset is to create an LFS repository, and add a copy of the dataset template repository as the initial commit. This will create empty directories for all data and will add placeholder files for dataset metadata.
Organizing the data¶
- The
raw
directory contains uningested science data. The directory may have any internal structure. - The
preloaded
directory contains a Gen 3 LSST Butler repository with calibration data, coadded difference imaging templates, refcats, and any other files needed for processing science data. It must not contain science data, which belongs only inraw
. - The
config/export.yaml
file is arelative-path export
of the repository atpreloaded
, used to set up a separate repository for runningap_verify
. - The
config
andpipelines
directories contain configuration overrides needed to run the AP pipeline on the data.
The templates and reference catalogs need not be all-sky, but should cover the combined footprint of all the raw images.
Datasets should contain a scripts
directory with scripts for (re)generatating and maintaining the contents of the dataset.
This allows the dataset, particularly calibs and templates, to be updated with pipeline improvements.
The scripts
directory is not formally part of the dataset framework, and its exact contents are up to the maintainer.
Documenting datasets¶
Datasets provide package-level documentation in their doc
directory.
An example is provided in the dataset template repository.
The dataset’s package-level documentation should include:
- the source of the data (e.g., a particular survey with specific cuts applied)
- whether or not optional files such as image differencing templates are provided
- the expected use of the data
Configuring dataset use¶
The files in config
or pipelines
should override any config fields that are constrained by the input data, such as template type (deep, goodSeeing, etc.) or refcat filters, even if the current defaults match.
This policy makes the datasets more self-contained and prevents them from breaking when the pipeline defaults change but only one value is valid (e.g., coaddName
must be "deep"
for a dataset with deep coadds).
Each pipelines
directory should contain pipeline files corresponding to the pipelines in the ap_verify/pipelines
directory (at the time of writing, ApPipe.yaml
, ApVerify.yaml
, and ApVerifyWithFakes.yaml
).
The default execution of ap_verify
assumes these files exist for each dataset, though --pipeline
can override it.
Configuration settings specific to an instrument rather than a dataset should be handled with ordinary configuration override files.
Registering an observatory package¶
To ensure dataset processing does not crash, ups/<package>.table
must contain a line reading setupRequired(<obs-package>)
.
For example, for DECam data this would read setupRequired(obs_decam)
.
If any other unusual packages are required to process the data, they should have their own setupRequired
lines.