Error-handling in ap_verify and failed runs

The Alert Production pipeline may fail for reasons ranging from corrupted data to improperly configured datasets to bugs in the code. The ap_verify framework tries to handle failures gracefully to minimize wasted server time and maximize debugging potential.

Error-handling policy

ap_verify does not attempt to resolve exceptions emitted by pipeline tasks, on the grounds that it does not have enough information about the pipeline implementation to provide any meaningful resolution. Nor does it try to ignore errors and press forward (although it does not prevent individual tasks from adopting this approach), as doing so tends to lead to cascading failures from an incomplete and possibly corrupted data set. Terminating on failure allows pipeline problems to be detected quickly during testing, rather than after a day or more of processing.

If a task fails with a fatal error, ap_verify.py will clean up and shut down. In particular, where possible it will preserve metrics computed before the failure point.

Recovering metrics from partial runs

In Gen 2 mode, ap_verify produces some measurements even if the pipeline cannot run to completion. Specifically, if a task fails, any previously completed tasks that store measurements to disk will have done so. In addition, if a metric cannot be computed, ap_verify may attempt to store the values of the remaining metrics.

If the pipeline fails, ap_verify may not preserve measurements computed from the dataset. Once the framework for handling metrics is finalized, ap_verify may be able to offer a broader guarantee that does not depend on how or where any individual metric is implemented.

The Gen 3 framework is not yet mature enough to handle partial failures. It is expected that Gen 3 processing will eventually be able to compute all metrics from completed tasks.