d0repro - Production scripts for p17 data reprocessing

This documentation is based on release v1_4_3.

Installation

You should install this script to the same products area that you have jim_client in.

setup upd
upd install -G-c d0repro -h www-d0.fnal.gov

To configure please follow the description in the INSTALL_NOTE whose path is mentioned at the end of the installation. The configuration will remain valid between updates from v0_7_4 onward.

Usage

Production Commands

setup d0repro

sub_production.py <dataset> <d0release> [--test]
will submit the d0reco production for the given datasets of raw files and mark the production as RUNNING.
It can be used for initial submission as well as for recovery jobs. Files already available will be automatically excluded.
Please allow 12h after a previous production grid-job for the same dataset finished before resubmitting.

sub_merge.py <dataset> <d0release> [--test] [--nostore]
will submit the merge stop for the files produced from the given datasets of raw files and mark the merge procedure as RUNNING.
It is save to use the same command for recovery if no other merge job for the same dataset is still running.
Please allow 12h after the production grid-jobs are finished before submitting a merge job. Also allow 12h before reissuing the same command to recover from failures.

check_production.py <dataset> <d0release> [--test]
Will print the current status of the d0reco production for a given dataset of raw files and store the status of the corresponding production (in production.status). The status will be set to RUNNING if a running or idle grid job is found. Else it will be set to COMPLETE if for all files of the input raw dataset an unmerged thumbnail was produced; the status will be set to PARTIAL if any unmerged thumbnails are missing in SAM and no job is running.

check_merge.py <dataset> <d0release> [--test]
Will print the current status of the d0reco production and merging for a given dataset of raw files. and store the status of the corresponding production and merge steps (in production.status and merge.status).

undeclare_ghosts.py <dataset> <d0release> [--test] [--force]
Will search for merged files declared to sam, but without location and for unmerged-thumbnails declared to sam without location that don't have a merged daughter.
After a 10s waiting time in which the script might be aborted with Ctrl-C it tries to "sam undeclare" these files.
All grid-jobs must be completed since more than 12hours for the command to operate. --force will lift this requirement.

Common Arguments

<dataset> is the dataset of raw files under consideration for all production related command.
<d0release> is the release version to be used (p17.01.00, p17.02.00, p17.03.01 and p17.03.03 currently allowed)
--test will switch the application name and version reported in the metadata to d0reco-test and <d0release>-test, respectively. This parameter is optional. It should be used for testing and for certification runs.
--nostore will turn of storing the merged files to enstore in merge jobs.

Commands to administer job status

set_status.py[merge|production] [new|approved|finished|held] <dataset> <d0release> [--test]
Sets the status of a merge or production step to "new", "approved" or "finished". "Finished" is meant to mark a job which has unrecoverable errors to be done, i.e. no more recovery jobs will be run. "Approved" is meant to mark a job for later recovery (will be submitted by auto_pilot). "Held" is meant to mark a job for later investigation (will be ignored by auto_pilot).

clean_completed.py
will move projects that has a merge status of complete or finished (from comlete of finished inputs) to a subdirectory.

Arguments

<dataset> is the dataset of raw files under consideration for all production related command.
<d0release> is the release version to be used.
--test will switch the application name and version reported in the metadata to d0reco-test and <d0release>-test, respectively. This parameter is optional. It should be used for testing and for certification runs.

Commands that operate on multiple jobs

check_all.py
This command will run do check_merge.py on all jobs currently available in the d0repro-work directory.

list_status.py [--all]
This will summarise the status of all jobs as computed by a previously run check_all.py or check_merge.py. For jobs which are in status PARTIAL the auto-pilot suggestion is given.
--all Also jobs in status NEW are listed. By default these are suppresed.

auto_pilot.py [ --all | --merge-only | --production-only | --auto-approve]
This will summarise the status of all jobs as computed by a previously run dcheck_all.pyd/check_merge.py For jobs which are in status PARTIAL the auto-pilot suggestion is given (as list_status.py). In addition a script named "Autopilot.sh" is created in the d0repro-work directory.
--all - jobs in status NEW are listed. By default these are suppresed.
--merge-only - only sub_merge commands will by added to the script Autopilot.sh
--production-only - only sub_production commands will by added to the script Autopilot.sh
--auto-approve - in addition to the sub_* commands some datasets which are in status NEW will be moved to status APPROVED. The number of datasets to be approved is currently identical to the number of merge jobs suggested by the autopilot.

Commands used for Certification

create_certification_dataset.py [unmerged|merged] <dataset_of_inputs> <d0release> <gid_regexp> --test
creates a dataset of unmerged (or merged) thumbnails produced from a given dataset of inputs by the named grid jobs.

Arguments

unmerged defines to look for files with the datatier unmerged-thumbnail produced from the given dataset.
merged defines to look for files with the datatier thumbnail
<dataset_of_inputs> dataset of input files.
<d0release> is the release version to be used.
<gid_regexp> regular expresion to be used to filter for files from a given production.
--test will switch the application name and version reported in the metadata to d0reco-test and p17.02.00-test, respectively. This parameter is formaly optional, but it should always be used in certification runs.

Restrictions

Release history

1.0.1 - Added support for p17.03.03
1.0.2 - Fixed bug in check_* scripts reported by Tibor.
1.0.3 - Improved messages from check_production
1.1.0 - Rewrote queries that determines children of a given file (results in a speed up by an order of magnitude)
1.2 - New jobs dataset for p17.03.03 production. Fixes tarball bug.
1.3 - Added check for correct size of unmerged-thumbnails
1.4 - Added autopilot functionalities

Daniel Wicke (wicke@fnal.gov)
Last modified: Wed Sep 21 03:30:40 CDT 2005