Getting started tutorial part 2: calibrating single frames with Pipeline Tasks¶
In this part of the tutorial series you’ll process individual raw HSC images in the Butler repository (which you assembled in part 1) into calibrated exposures. We’ll use the singleFrame pipeline to remove instrumental signatures with dark, bias and flat field calibration images. This will also use the reference catalog to establish a preliminary WCS and photometric zeropoint solution.
Pipelines are defined in
The example git repository contains a pipeline definition for data release processing that is only slightly modified from the one used in production processing of HSC data.
If you are interested in examining the pipeline defintions provided for this tutorial, look in
Pick up your shell session where you left off in part 1. For convenience, start in the top directory of the example git repository:
lsst_distrib package also needs to be set up in your shell environment.
See Setting up installed LSST Science Pipelines for details on doing this.
Reviewing what data will be processed¶
Pipeline tasks can operate on a single image or iterate over multiple images. You can query for the data IDs of the exposures available to process using the butler command line utility.
butler query-data-ids SMALL_HSC exposure,detector --collections HSC/RC2/defaults --datasets 'raw'
The first few lines look something like:
band instrument detector physical_filter exposure ---- ---------- -------- --------------- -------- y HSC 41 HSC-Y 322 y HSC 42 HSC-Y 322 y HSC 47 HSC-Y 322 y HSC 49 HSC-Y 322
Notice the keys that describe each data ID, such as the
exposure (exposure identifier for the HSC camera),
detector (identifies a specific chip in the HSC camera),
band (a generic specifier for about what part of the EM spectrum the filter samples) and
physical_filter (the identifier of the specific implementation of this band for the HSC camera).
With these keys you can select exactly what data you want to process.
It’s worth noting that the keys shown here are not the minimal ones needed to specify, for example, a
Only instrument+detector+exposure are necessary to uniquely identify a specific
raw dataset, because the system knows that an exposure implies a physical_filter and a physical_filter implies a band.
raw dataset is special in the sense that it is the starting point for all processing.
Images in the
raw dataset have not had any processing applied to them and are as they arrive from the data aquisition system.
The important arguments are
--collections argument allows you to select which data to search for data IDs.
You can see available collections by, again, using the butler utility.
butler query-collections SMALL_HSC
--datasets argument allows you to specify what type of data to query for data IDs.
To ask the repository which values are available to pass, you can say:
butler query-dataset-types SMALL_HSC
You can also filter the datasets you get back using the
For example, here’s how to select just
butler query-data-ids SMALL_HSC exposure,detector --collections HSC/RC2/defaults --datasets 'raw' --where "physical_filter='HSC-I' AND instrument='HSC'"
Now only data IDs for
HSC-I datasets are printed.
For instrument specific things like the filter, the instrument must be specified.
The instruments registered with a particular repository can be retrieved using the
query-dimension-records subcommand of
butler query-dimension-records SMALL_HSC/ instrument
There is only one instrument in this repository, so you only see metadata about that one instrument. The result of the above command should look like this:
name visit_max exposure_max detector_max class_name ---- --------- ------------ ------------ ------------------------------- HSC 21474800 21474800 200 lsst.obs.subaru.HyperSuprimeCam
For more information about the
butler command line tool, try
Running single frame processing¶
As mentioned in part 1, this part of the processing is by far the most time consuming.
If you do not wish to process all the data in the repository at this time, you can specify a data query that will reduce the number of exposures to be processed.
Simply add the argument
-d "instrument='HSC' AND detector=41 AND exposure=322" to the command line below, and change
Note that this will give you an idea of how to execute processing steps from the command line, but this subset of data will not be sufficient for the full tutorial.
If you wish to follow the entire tutorial, you will need to use the full
Data queries will be discussed in more detail later.
After learning about datasets, go ahead and run single frame processing using the pipetask command on all
raw datasets in the repository:
pipetask run -b $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml \ -p $RC2_SUBSET_DIR/pipelines/DRP.yaml#singleFrame \ -i HSC/RC2/defaults \ -o u/$USER/single_frame \ --register-dataset-types
There are many arguments to command:
You can get useful information by saying command:
pipetask run --help, but let’s go over the ones listed here.
-b option specifies which butler definition to use when constructing the
Butler object to use in processing.
-p option specifies which pipeline to run.
The full pipeline definition lives in the
DRP.yaml file, but subtasks of the full processing can be run by specifying the subtask name with the
# character, e.g.
#singleFrame in this case.
-i option indicates the input collections to use in processing.
You will learn more about collections later in this document.
-o option defines the output collection to send the results of the processing to.
These tutorials suggest that you put the outputs in collections under a namespace defined by your username since that is unique for a given system.
In this case, there is little reason to be so careful because you are likely to have cloned into a space not shared with others.
However, it is good practice for times when you may be using a repository with a registry used by other users on the same system.
--register-dataset-types switch tells the butler to register a dataset type if it doesn’t already have a definition for it.
Because pipelines are allowed to define datasets at runtime, this switch is necessary if you expect products to be produced that are not already represented in the registry as in this case where we are producing calibrated exposures in a repository that contains only
If you expect that all of the dataset types should already be registiered, as is the case when processing another subset of data with a pipeline that has already been run, it can help catch unexpected behavior to remove that switch.
It is not included in the above command, but the
-j option is useful if you have more than one core available to you.
-j<num cores> will run in parallel where
<num cores> is the number of processes to execute in parallel.
Dataset queries can be specified using the
-d argument to specify which specific datasets should be considered when building the execution graph.
If this argument is omitted, all data in the repository that can be processed based on other inputs, e.g. calibrations, will be.
Aside: collections and quantum graphs¶
Collections are the primary way data in butler repositories are organized.
Of the types of collections available, the two of interest here are the
RUN collections are the least flexible.
Once a dataset is added to a
RUN collection, it can never be moved to a different
The constraints on datasets in
RUN collections makes these collections that most efficient to store and query.
The collection containing the raw data is a
CHAINED collections are groupings of other collections associated with an alias for that grouping.
The grouping of collections defines the order of collections to search when looking for a dataset associated with a specific data ID.
The collection produced from the
-o option above is a
The output collection will, in general, include all the collections in the input plus any
RUN collections produced by the processing.
The first step of process data is to produce the quantum graph for the processing. This is a directed acyclic graph that completely defines inputs and outputs for every node (quantum) in the graph.
Quantum graphs can be saved for reuse later, and diagnostic
graphviz files can be used to visualize the quantum graph.
qgraph subcommand to
pipetask can be used to generate the quantum graph without doing any further processing.
The full processing produces a quantum graph that has many nodes and is hard to look at on one page.
There is a simplified version of the pipeline that is not sufficient for other pipelines, but that does produce a simple enough quantum graph to easily be viewed on one page.
It is called
Try building the quantum graph for the processing of a single detector:
pipetask qgraph -b $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml \ -p $RC2_SUBSET_DIR/pipelines/DRP.yaml#simpleSingleFrame \ -i HSC/RC2/defaults \ -o u/$USER/single_frame \ -d "instrument='HSC' AND detector=41 AND exposure=322" \ --qgraph-dot single_frame.dot \ --save-qgraph single_frame.qgraph
The quantum graph is saved in
pickle format in the file called
graphviz file is in
If you have
graphviz installed, you can turn the
dot file into something you look at via a command like this:
dot -Tpdf -osingle_frame.pdf single_frame.dot
This should produce something similar to the following figure.
In this tutorial, you’ve used the pipetask run command to calibrate
raw images in a Butler repository.
Here are some key takeaways:
The pipetask run command, with appropriate arguments and switches, processes
rawdatasets, applying both photometric and astrometric calibrations.
Datasets are described by both a type and data ID. Data IDs are key-value pairs that describe a dataset (for example
Dataset queries can be used to specify which datasets to process..
Pipelines write their outputs to a Butler data repository. Collections are used to organize and associate outputs of processing with the inputs to the processing.
Continue this tutorial in part 3, where you’ll learn how to display these calibrated exposures.