######################################### Processing DC2 data with the AP Pipelines ######################################### A walkthrough for running the Alert Production (AP) pipeline on an example set of image data. The data used in this guide is simulated data, the same kind used for Rubin's Data Preview 0 (DP0). .. note:: This guide assumes the user has access to a shared Butler repository containing data from the Dark Energy Science Collaboration (DESC)'s Data Challenge 2 (DC2) via the `US Data Facility (USDF) <https://developer.lsst.io/usdf/storage.html>`__. This guide further assumes the user has a recently-built version of ``lsst.distrib`` from the `LSST Science Pipelines <https://developer.lsst.io/usdf/stack.html>`__ (circa ``w_2022_30`` or later). What does existing DC2 data look like? ====================================== * The instrument is called ``LSSTCam-imSim`` * The obs-package is ``obs_lsst``, i.e., :ref:`lsst.obs.lsst` (note ``imsim`` subdirectories within) * The skymap is called ``DC2`` * Patches go from 0 to 48 * Detectors go from 0 to 188 * Available bands are ``ugrizy`` * Data exist for some tracts in the ~2553–5074 range, and commonly reprocessed tracts include 3828, 3829, and 4431 * One set of reference catalogs are called ``cal_ref_cat_2_2``, and have some filtermap definitions that must be specified The Shared DC2 Butler Repository at the USDF -------------------------------------------- * The Butler repository is located at ``/sdf/group/rubin/repo/dc2``, which the Butler also recognizes via the alias ``/repo/dc2`` * All raw data are available in the collection ``2.2i/raw/all`` * All raw data with ancillary processing inputs (e.g., calibs, skymaps, refcats) are available in the collection ``2.2i/defaults``; this collection includes ``2.2i/raw/all``, and several other collections, inside it * Tracts 3828 and 3829 only are in the collection ``2.2i/defaults/test-med-1`` * A smaller CI dataset, just patch 24 in tract 3828, is in ``2.2i/defaults/ci_imsim`` and corresponds to the GitHub repo ``testdata_ci_imsim`` * Tract 4431 has a 10-year simulated depth, and was processed in the collection ``u/kherner/2.2i/runs/tract4431-w40`` * Visits that fully overlap four adjacent patches (9, 10, 16, and 17) in tract 4431 are in the collection ``u/mrawls/DM-34827/defaults/4patch_4431`` (this guide will use this collection!) Using the Butler to explore collections and datasets ---------------------------------------------------- To explore these and other collections, try, e.g., .. prompt:: bash butler query-collections /repo/dc2 "2.2i/defaults" butler query-collections /repo/dc2 "u/mrawls/DM-34827*" These commands should print a list of collections that meet the search criteria. To see some available datasets for processing, try, e.g., .. prompt:: bash butler query-data-ids /repo/dc2 tract patch visit --collections='u/mrawls/DM-34827/defaults/4patch_4431' --where "skymap='DC2' AND band='g' AND instrument='LSSTCam-imSim'" --datasets "raw" This command should print a list of data IDs that meet the search criteria, along with their tract, patch, and visit number. Certain arguments are required after the ``--where``, including ``skymap`` and ``instrument``, while most others are optional, and may include ``band``, ``tract``, ``patch``, etc. Processing Data with the AP Pipelines ===================================== Now it's time to process some data. In this guide, we will run a template-building pipeline, ``ApTemplate.yaml``, first. This pipeline starts with raw images and runs standard single frame processing (which includes :py:class:`lsst.ip.isr.isrTask.IsrTask`, :py:class:`lsst.pipe.tasks.characterizeImage.CharacterizeImageTask`, and :py:class:`lsst.pipe.tasks.calibrate.CalibrateTask`). From here, it is possible to run :py:class:`lsst.pipe.tasks.postprocess.ConsolidateVisitSummaryTask`, :py:class:`lsst.pipe.tasks.makeCoaddTempExp.MakeWarpTask`, :py:class:`lsst.pipe.tasks.selectImages.BestSeeingQuantileSelectVisitsTask`, and :py:class:`lsst.pipe.tasks.assembleCoadd.CompareWarpAssemblecoaddTask`. The final result is good seeing coadd templates. In a second pipeline, ``ApPipe.yaml``, we will run difference imaging using the templates we just built. This pipeline also starts with single frame processing on raw images, followed by :py:class:`lsst.ip.diffim.subtractImages.AlardLuptonSubtractTask`, :py:class:`lsst.ip.diffim.detectAndMeasure.DetectAndMeasureTask`, :py:class:`lsst.ap.association.transformDiaSourceCatalog.TransformDiaSourceCatalogTask`, and :py:class:`lsst.ap.association.diaPipe.DiaPipelineTask`. The final results include difference images, some output catalogs, and an Alert Production Database (APDB). Building good seeing templates ------------------------------ The pipeline we will use lives in the ``ap_pipe`` package, and is the camera-specific ``ApTemplate.yaml`` pipeline. To see it, either navigate to the `pipeline on GitHub <https://github.com/lsst/ap_pipe/blob/main/pipelines/LsstCamImSim/ApTemplate.yaml>`__ or display the pipeline on via the command line, e.g., .. prompt:: bash cat $AP_PIPE_DIR/pipelines/LsstCamImSim/ApTemplate.yaml Note that this camera-specific pipeline imports both a camera-specific single-frame processing pipeline (sometimes called "processCcd") and a more generic AP Template building pipeline. To visualize this pipeline, you may wish to use ``pipetask build``, e.g., .. prompt:: bash pipetask build -p $AP_PIPE_DIR/pipelines/LsstCamImSim/ApTemplate.yaml --pipeline-dot ApTemplate.dot dot ApTemplate.dot -Tpng > ApTemplate.png Alternately, navigate to `this website that serves visualizations of all the AP and DRP pipelines <https://tigress-web.princeton.edu/~lkelvin/pipelines/current>`__. Click through to ``ap_pipe``, then ``LsstCamImSim``, and finally ``ApTemplate`` to find a PDF visualizing all the pipeline inputs, outputs, and intermediate data products. This PDF is auto-generated each week using the same ``pipetask build`` command as shown above. To run this pipeline, make up an appropriate output collection name (``u/USERNAME/OUTPUT-COLLECTION-1`` in the example below), and run .. prompt:: bash pipetask run -j 4 -b /repo/dc2 -d "skymap='DC2' AND tract=4431 AND patch IN (9, 10, 16, 17) AND band='g'" -i 2.2i/defaults -o u/USERNAME/OUTPUT-COLLECTION-1 -p $AP_PIPE_DIR/pipelines/LsstCamImSim/ApTemplate.yaml To tell the process to run in the background and write output to a logfile, you may wish to prepend ``pipetask run`` with ``nohup`` and postpend the command with ``> OUTFILENAME &``. This will take some time, but when it's done, you should have calibrated exposures and a visit summary table, warps, and assembled good seeing coadds for use as templates. We are now ready to run the rest of the AP Pipeline (namely difference imaging and source association). Performing difference imaging and making an APDB ------------------------------------------------ This next step uses a second pipeline, which begins once again with single frame processing. If you choose to reuse some or all of the same input raw exposures, all previously-run steps will automatically be skipped and pre-existing outputs used. Afterwards, it performs difference imaging and saves the results in an Alert Production Database (APDB). The pipeline we will use also lives in the ``ap_pipe`` package, and is the camera-specific ``ApPipe.yaml`` pipeline. To see it, either navigate to the `pipeline on GitHub <https://github.com/lsst/ap_pipe/blob/main/pipelines/LsstCamImSim/ApPipe.yaml>`__ or display the pipeline on via the command line, e.g., .. prompt:: bash cat $AP_PIPE_DIR/pipelines/LsstCamImSim/ApPipe.yaml This difference imaging pipeline requires coadds as inputs for use as templates, and treats all input raws as "science" images. Unlike before, however, we need to create an empty APDB for the final step of the pipeline to connect and write to. The simplest option, which works fine for relatively small processing runs, is to create an empty sqlite database in your working directory. Larger runs will require using, e.g., PostgreSQL, which is beyond the scope of this guide. To create an empty sqlite APDB: .. prompt:: bash make_apdb.py -c db_url="PATH-TO-YOUR-APDB-HERE" **The APDB must exist and be empty before you run the AP Pipeline.** It is highly recommended to make a new APDB each time the AP Pipeline is rerun for any reason. A typical ``db_url`` is, e.g., ``sqlite:////path/to/my-working-directory/run1.db``. The configs you set when making the APDB must match those you give the AP Pipeline at runtime. As before, to visualize the AP Pipeline, you may navigate to `the website with visualizations of all the AP and DRP pipelines <https://tigress-web.princeton.edu/~lkelvin/pipelines/current>`__. Click through to ``ap_pipe``, then ``LsstCamImSim``, and finally ``ApPipe`` to find a PDF visualizing all the pipeline inputs, outputs, and intermediate data products. This PDF is auto-generated each week using an analogous ``pipetask build`` command as shown above for ``ApTemplate.yaml``. You are now ready to run the AP Pipeline! You will need to substitute appropriate values for your input collections, your desired new output collection, and your APDB URL in order to run .. prompt:: bash pipetask run -j 4 -b /repo/dc2 -d "skymap='DC2' AND band='g'" -i u/USERNAME/OUTPUT-COLLECTION-1,u/mrawls/DM-34827/defaults/4patch_4431 -o u/USERNAME/OUTPUT-COLLECTION-2 -p $AP_PIPE_DIR/pipelines/LsstCamImSim/ApPipe.yaml -c diaPipe:apdb.db_url="PATH-TO-YOUR-APDB-HERE" What are the output data products? ================================== When the AP Pipeline completes, you will have difference images, difference image source tables, and an APDB with populated tables (``DiaSource``, ``DiaObject``, etc.) for ``g`` band visits that fully overlap four patches of tract 4431. A few analysis and plotting tools exist to explore the APDB and other AP Pipeline outputs. These live in `analysis_ap <https://github.com/lsst/analysis_ap>`__. One output from the AP Pipeline are DIA (Difference Image Analysis) Source Tables, which the Butler can retrieve via ``goodSeeingDiff_diaSrcTable``. To see what DIA Source Tables exist, query, e.g., .. prompt:: bash butler query-data-ids /repo/dc2 visit detector --collections="u/USERNAME/OUTPUT-COLLECTION-2" --where "skymap='DC2' AND band='g' AND instrument='LSSTCam-imSim'" --datasets "goodSeeingDiff_diaSrcTable" The APDB also contains several tables with information about DIA Sources, DIA Objects, and Solar System Objects. Objects represent real astrophysical things, and are created by spatially associating per-visit Sources. The DIA prefix indicates we are talking about Sources and Objects in difference images. More information about the APDB schema is available in `sdm_schemas <https://github.com/lsst/sdm_schemas/blob/main/yml/apdb.yaml?>`__. .. note:: None of the following is a formally supported APDB user interface. It one way to load a table from the APDB into memory in python and make a quick plot to see where the associated DIA Objects fall on the sky. It also includes an example of how to load a ``goodSeeingDiff_diaSrcTable`` with the Butler for further analysis. Future plans include support for visualizing some AP Pipeline outputs via :ref:`lsst.analysis.tools` and/or :ref:`lsst.analysis.ap`. Give this a try in a Jupyter notebook: .. code-block:: python :name: apdb-simple-example %matplotlib notebook import sqlite3 import pandas as pd import matplotlib.pyplot as plt import lsst.daf.butler as dafButler # Define the data we are exploring, and instantiate a Butler repo = '/repo/dc2' collections = 'u/USERNAME/OUTPUT-COLLECTION-2' instrument='LSSTCam-imSim' skymap='DC2' butler = dafButler.Butler(repo, collections=collections, instrument=instrument, skymap=skymap) # Load a diaSrcTable from the Butler for one (visit, detector) diaSrcTable_example = butler.get('goodSeeingDiff_diaSrcTable', visit=960220, detector=33) # Take a look at it diaSrcTable_example.head() # Connect to the APDB and load all DiaObjects from the whole run connection = sqlite3.connect('/path/to/my-working-directory/run1.db') objTable = pd.read_sql_query('select "diaObjectId", "ra", "decl", \ "nDiaSources", "gPSFluxMean", "validityEnd" \ from '"DiaObject"' where "validityEnd" is NULL;', connection) # Take a look at it objTable # Plot DIA Objects on the sky fig = plt.figure(figsize=(6,6)) ax = fig.add_subplot(111) ax.scatter(objTable.ra, objTable.decl, s=objTable.nDiaSources*2, marker='o', alpha=0.4) ax.set_xlabel('RA (deg)') ax.set_ylabel('Dec (deg)') ax.set_title('DIA Objects on the sky') Processing Data with BPS ======================== The example data processing steps above assume a relatively small data volume, so running from the command line and using an sqlite APDB is appropriate. However, if you want to process larger data volumes, you'll need to use the Batch Processing System (BPS, :py:mod:`lsst.ctrl.bps`) and a PostgreSQL APDB. Describing how to set up a PostgreSQL APDB from scratch is beyond the scope of this guide. One key difference between using an sqlite APDB versus a PostgreSQL APDB is that the former is a file on disk created from scratch when running ``make_apdb.py``. The latter requires a database to already exist, and ``make_apdb.py`` turns the specified schema (via the ``namespace`` config option) in an existing PostgreSQL database into an empty APDB. As before, you will still need to run, e.g., .. prompt:: bash make_apdb.py -c db_url="postgresql://USER@DB_ADDRESS/DB_NAME" -c namespace='DESIRED_POSTGRES_SCHEMA_NAME' (being sure to replace ``USER``, ``DB_ADDRESS``, and ``DB_NAME`` with appropriate values). Next, use the documentation for :py:mod:`lsst.ctrl.bps` to `define a submission <https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#defining-a-submission>`__ by creating two BPS configuration files --- one for the template-building step and one for the difference-imaging step. Save these BPS configuration files as ``ApTemplate-DC2-bps.yaml`` and ``ApPipe-DC2-bps.yaml``. .. note:: The :py:mod:`lsst.ctrl.bps` module is well-documented, but at the time of this writing, best practices for running BPS at the USDF are still in development. Refer to the `USDF documentation pages <https://developer.lsst.io/usdf/batch.html>`__ for the latest recommendations. There is likely a set of default configurations users must import or place directly in their BPS configuration file that pertain to the underlying architecture for batch job submissions. Ensure the ``pipelineYaml`` keyword points to the appropriate ApTemplate and ApPipe pipelines in each BPS configuration file, and that you specify appropriate values for ``butlerConfig``, ``inCollection``, ``outCollection`` (or ``payloadName``, which may be used to construct ``outCollection``), and ``dataQuery``. These values mirror those on the command line via ``pipetask run`` and the ``-b``, ``-i``, ``-o``, and ``-d`` arguments, respectively. For example, to make good seeing templates using all available patches and bands in two entire tracts, you may wish to use a data query like ``instrument='LSSTCam-imSim' and tract in (3828, 3829) and skymap='DC2'``. When you are ready to submit your first BPS run to build templates, follow the documentation to `submit a run <https://pipelines.lsst.io/v/weekly/modules/lsst.ctrl.bps/quickstart.html#submitting-a-run>`__, e.g., .. prompt:: bash bps submit ApTemplate-DC2-bps.yaml Once the templates are built, the second BPS configuration file will typically need to have two input collections: the output collection from the first run and a collection with raw science images. As before, you will need to run ``make_apdb.py`` prior to running the second pipeline. To configure the APDB in a BPS configuration file that runs ``ApPipe.yaml``, add a line like this for a PostgreSQL APDB: .. prompt:: bash extraQgraphOptions: "-c diaPipe:apdb.db_url='postgresql://USER@DB_ADDRESS/DB_NAME' -c diaPipe:apdb.namespace='DESIRED_POSTGRES_SCHEMA_NAME'" Finally, to submit the second BPS run and perform difference imaging and populate the APDB, run, e.g., .. prompt:: bash bps submit ApPipe-DC2-bps.yaml