Packaging data as a dataset¶
The dataset framework is designed to be as generic as possible, and should be able to accommodate any collection of observations so long as the source observatory has an observatory interface (obs) package in the LSST software stack. This page describes how to create and maintain a dataset.
Creating a dataset repository¶
Datasets are Git LFS repositories with a particular directory and file structure. The easiest way to create a new dataset is to create an LFS repository, and add a copy of the dataset template repository as the initial commit. This will create empty directories for all data and will add placeholder files for dataset metadata.
Organizing the data¶
- The
raw
andcalib
directories contain science and calibration data, respectively. The directories may have any internal structure. - The
templates
directory contains a Gen 2 LSST Butler repository containing processed images usable as templates. Template files must beTemplateCoadd
files produced by a compatible version of the LSST science pipeline. - The
refcats
directory contains one or more tar files, each containing one astrometric or photometric reference catalog in HTM shard format. - The
preloaded
directory contains a Gen 3 LSST Butler repository with calibration data, templates, refcats, and any other files needed for processing science data. If the dataset supports both Gen 2 and Gen 3 processing, the contents ofpreloaded
should be equivalent to those ofcalib
,templates
, andrefcats
(and may link to them to save space). It must not contain science data, which belongs only inraw
.
The templates and reference catalogs need not be all-sky, but should cover the combined footprint of all the raw images.
Datasets should contain a scripts
directory with scripts for (re)generatating and maintaining the contents of the dataset.
This allows the dataset, particularly calibs and templates, to be updated with pipeline improvements.
The scripts
directory is not formally part of the dataset framework, and its exact contents are up to the maintainer.
Documenting datasets¶
Datasets provide package-level documentation in their doc
directory.
An example is provided in the dataset template repository.
The dataset’s package-level documentation should include:
- the source of the data (e.g., a particular survey with specific cuts applied)
- whether or not optional files such as image differencing templates are provided
- the expected use of the data
Configuring dataset ingestion and use¶
Each dataset’s config
directory should contain a task config file named datasetIngest.py
, which specifies a DatasetIngestConfig
.
The file typically contains filenames or file patterns specific to the dataset.
In particular, the default config ignores reference catalogs, so the config file should provide a dict
from catalog names to their tar files.
Each config
directory may contain a task config file named apPipe.py
, specifying an lsst.ap.pipe.ApPipeConfig
.
The file contains pipeline flags specific to the dataset, such as the available reference catalogs (both their names and configuration) or the type of template provided to ImageDifferenceTask
.
Each pipelines
directory should contain pipeline files corresponding to the pipelines in the ap_verify/pipelines
directory (at the time of writing, ApPipe.yaml
, ApVerify.yaml
, and ApVerifyWithFakes.yaml
).
These files should incorporate the same dataset-specific configuration overrides as described above for apPipe.py
.
Configuration settings specific to an instrument rather than a dataset should be handled with ordinary configuration override files.
Registering an observatory package¶
The observatory package must be named in two files:
ups/<package>.table
must contain a line readingsetupRequired(<obs-package>)
. For example, for DECam data this would readsetupRequired(obs_decam)
. If any other packages are required to process the data, they should have their ownsetupRequired
lines.repo/_mapper
must contain a single line with the name of the obs package’s mapper class. For DECam data this islsst.obs.decam.DecamMapper
.