D0 P17 Reprocessing

Mike Diesburg (diesburg@fnal.gov)
Daniel Wicke (wicke@fnal.gov)

Participating sites

FNAL Farm, CMS-Farm, GridKa, Lyon, Prague, D0SAR, UK, Westgrid.

Overall status

Reprocessing Status

TMBfix Status

Refix Status

Detailed stats and run assignments are here

Reprocessing Status

Processing Totals for Each Site
Cumulative P17 production plots, all sites
Cumulative P17 production plots, SAMGrid sites
Input Projects and Processing Statistics for Each Site

Reprocessing Recovery Status

Reprocessing recovery was able to recover about 20M events of 30M missing. A detailed investigation of the remaining failures is documented in a reprocessing recovery log. This investigation isn't completed as other tasks have taken higher priority.

Software to be used in production

D0Release: p17.03.03

Infrastructure: SAMGrid

The actual production should take place with the March release cut of SAMGrid (see below).

SamGrid Release Cuts

Currently it is recommended to run reprocessing with the August release cut. The indicated bug fixes to the cut are needed only for those who observe problems with the original installation.

Please check the description of the SamGrid release cuts for details.

DØ Release Tarballs

DØ Release Tarballs are required by the SamGrid infrastructure to operate on DØ releases. They're created centrally and stored into SAM. The d0tools contain a table that assigns the release tarball to be used for each supported d0release. A description of how to create new tarballs by Cano Ay.

Operations: d0repro

To ease the production the d0repro package of is recommended. The production release (p17.03.03) is supported from v1_0_1 onward. Please upgrade to this version or newer.

The procedure should be as follows:
1st) start a jobs with
$ sub_production.py <your-dayset-name> p17.03.03
2nd) test completion with
$ check_production.py <your-dayset-name> p17.03.03
in case of partial success go back to step one and resubmit. Please allow 12hours between job completion and submitting the recovery job.
You should retry until the all files are done, or the remaining file show an unrecoverable error. Unrecoverable errors are those which crash two times with the same exit code in the same event (according to the monitoring page).

In case of success contintue with merging:
3rd) submit merge job with
$ sub_merge.py <your-dayset-name> p17.03.03
4th) test completion with
$ check_merge.py <your-dayset-name> p17.03.03
in case of partial success redo step 3. Please allow 12hours between job completion and submitting the recovery job.

Cleanup of durable location

In order to clean-up the durable location from unmerged thumbnails that are no longer needed samgrid_utils v2_0_1 or better should be installed. Attention: The script doesn't work with the "current" version of sam_user_api which is v5_1_0_5. It has been tested with sam_user_api v5_1_0_9. The version distributed in to our batch jobs via sam_client is v5_1_0_14. In the local environment this version needn't be marked current.

The following script is mean to be run as user sam in a cronjob. For now we should issue it manually until we gain some confidence.

source your/jim-ups/setup.csh
setup sam_user_api v5_1_0_9
setup jim_sandbox
setup jim_merge
remove_temp_files.py --duplicate-file-check=True --erased-file-location-check=True

Tasks Performed Within the p17 Reprocessing

Based on the experiences of the p14 reprocessing we want to improve the existing procedure. In general an improved procedure should be extensively tested before it becomes the baseline in our planning.

The anticipated start date of January 2005 allowed us to gridify the whole project. SamGrid (aka JIM and SAM) were chosen as a software platform. Delays caused by finalising the d0realse allowed us to improve the monitoring and the infrastructure a lot.

We expect to start around 20-March-2005.

Site certification (Joe Steele)

Each site must prove that it is able to run d0reco and produce results identical to those found at FNAL. Certification procedure requires each site to process two datasets of ~100files each. See title link for more information and results.

Remote TMB merging

As the merging of TMBs can no longer be done on d0mino, the merging procedure has to be adapted. The merging shall be working on remote sites.
Status: Merging scripts were rewritten by Andrew Baranowski and rely on SAM as book-keeping.
Who is doing runjob 6? Iain will handle bug fixes..

Database Proxies (Thomas Nunnemann)

To reprocess from raw data, database proxies are needed to achieve fast enough access to the calibration databases.
Tasks: Prototype installation (done), Installation help for other sites, Field test with many sites,
EDG/LCG people need to investigate how to use this.

Dataset Preparation and Bookkeeping

Dataset creation has been Mike-power intensive and bookkeeping has been done individually at each site. A common (SAM-based?) bookkeeping would remove doublicated efforts.
SamGrid fully relies on SAM as its book-keeping database. Initial production scripts are provided by DW.
To aide bugtracing the JIM XML DBs are available at each site as a quick diagnostics tool.

Useful checks (DW)

To further automate the error detection we need to define checks that should be run on each produced file to verify its completeness and integrity. I'd be good to have the fact that all checks were performed checked as well.
- Check completion (has each input event been processed and been written?)
- Check data integrity (is the output file a valid DST/TMB?)
- Compute and pass on CRC if these tests have passed?
Suggestions of which further checks are useful are very welcome.

Result of each test should be stored with the checks version number as "not checked", "passed", "failed".

Such metadata 'parameters' will be available in SAM v6 only, which is slightly to late for the reprocessing. We have to live without.

Estimated contribution (Daniel Wicke, Mike Diesburg)

Estimated contributions by each site are linked to the site names at the top of this page.

Presentations of our Status and Plans

All D0 Meeting, 22-Oct-2004.
Wuppertal Group Meeting, 11-Nov-2004.
.....
Daniel Wicke, 13-Apr-2004. Last update 11-July-2006.
wicke@fnal.gov

Valid HTML 4.01!