Getting started tutorial part 2: calibrating single frames with processCcd.py

In this part of the tutorial series you’ll process individual raw HSC images in the Butler repository (which you assembled in part 1) into calibrated exposures. We’ll use the processCcd.py command-line task to remove instrumental signatures with dark, bias and flat field calibration images. processCcd.py will also use the reference catalog to establish a preliminary WCS and photometric zeropoint solution.

Set up

Pick up your shell session where you left off in part 1. That means your current working directory must contain the DATA directory (the Butler repository).

The lsst_distrib package also needs to be set up in your shell environment. See Setting up installed LSST Science Pipelines for details on doing this.

Reviewing what data will be processed

processCcd.py can operate on a single image or iterate over multiple images. You can do a dry-run to see what data will be processed in the Butler repository:

processCcd.py DATA --rerun processCcdOutputs --id --show data

The important arguments here are --id and --show data.

The --id argument allows you to select datasets to process by their data IDs. Data IDs describe individual datasets in the Butler repository. Datasets also have types, and each command-line task will only process data of certain types. In this case, processCcd.py processes raw exposures (uncalibrated images from individual CCD chips).

In the above command, the plain --id argument acts as a wildcard that selects all raw-type data in the repository (in a moment we’ll see how to filter data IDs).

The --show data argument puts processCcd.py into a dry-run mode that prints a list of data IDs to standard output that would be processed according to the --id argument rather than actually processing the data. For example, one line of the output from a processCcd.py run with --show data looks like:

id dataRef.dataId = {'taiObs': '2013-06-17', 'pointing': 533, 'visit': 903334, 'dateObs': '2013-06-17', 'filter': 'HSC-R', 'field': 'STRIPE82L', 'ccd': 23, 'expTime': 30.0}

Notice the keys that describe each data ID, such as the visit (exposure identifier for the HSC camera), ccd (identifies a specific chip in the HSC camera) and filter, among others. With these keys you can select exactly what data you want to process. For example, here’s how to select just HSC-I-band datasets:

processCcd.py DATA --rerun processCcdOutputs --id filter=HSC-I --show data

Now only data IDs for HSC-I datasets are printed. The --id argument supports a rich syntax for expressing data IDs by multiple selection criteria.

Running processCcd.py

After learning about datasets, go ahead and run processCcd.py on all raw datasets in the repository:

processCcd.py DATA --rerun processCcdOutputs --id

Aside: reruns and output Butler repositories

While processCcd.py runs, let’s discuss the --rerun argument. Command-line tasks, like processCcd.py, write their output datasets to Butler data repositories. There are two ways to specify an output data repository: with the --output argument, or with the --rerun command-line argument.

The rerun pattern is especially convenient, especially with local Butler repositories, because each rerun is packaged within the file system directory of the parent Butler data repository (the DATA directory in this tutorial). Above, when you ran processCcd.py, you configured it to write outputs to a new rerun named processCcdOutputs.

The idea is that you’ll process data by running a sequence of individual command-line tasks. At each stage, you will output datasets to a new rerun. This is called rerun chaining, and you learn how to do this in the next tutorial.

If you need to re-do a processing step, to experiment with a different command-line task configuration for example, you can do that safely by outputting to a new rerun.

Important

Bottom line: a given rerun must contain data that was all processed consistently, with the same task configurations. If you mix outputs from multiple runs of a command-line task with different configurations, it may impossible to understand or use the results of the data processing.

Wrap up

In this tutorial, you’ve used the processCcd.py command-line task to calibrate raw images in a Butler repository. Here are some key takeaways:

  • The processCcd.py command-line task processes raw datasets, applying both photometric and astrometric calibrations.
  • Datasets are described by both a type and data ID. Data IDs are key-value pairs that describe a dataset (for example filter, visit, ccd, field).
  • Command-line tasks have --id arguments that let you select which datasets to process. An empty --id arguments acts as a wildcard that selects all available datasets in the repository of the type the command-line task can processes.
  • Command-line tasks write their outputs to a Butler data repository. Reruns (--rerun argument) are a convenient way to create output data repositories. Make sure that all datasets in a rerun are processed consistently.

Continue this tutorial in part 3, where you’ll learn how display these calibrated exposures.