Mike Diesburg (diesburg@fnal.gov)
Daniel Wicke (wicke@fnal.gov)
Participating sites
FNAL Farm,
CMS-Farm,
GridKa,
Lyon,
Prague,
D0SAR,
UK,
Westgrid.
Overall status
Reprocessing Status
TMBfix Status
Refix Status
Reprocessing Status
Processing Totals for Each Site
Cumulative P17 production plots, all sites
Cumulative P17 production plots, SAMGrid sites
Input Projects and Processing Statistics for Each Site
Reprocessing Recovery Status
Reprocessing recovery was able to recover about 20M events of 30M missing. A detailed investigation of the remaining failures is documented
in a reprocessing recovery log. This investigation isn't completed as other tasks have taken higher priority.
Software to be used in production
D0Release: p17.03.03
Infrastructure: SAMGrid
The actual production should take place with the March release cut of SAMGrid (see below).
SamGrid Release Cuts
Currently it is recommended to run reprocessing with the August release cut.
The indicated bug fixes to the cut are needed only for those who observe problems with the original installation.
Please check the description of the SamGrid release cuts for details.
DØ Release Tarballs
DØ Release Tarballs are required by the SamGrid infrastructure to operate on DØ releases. They're created centrally and stored into SAM. The d0tools contain a table that assigns the release tarball to be used for each supported d0release. A description of how to create new tarballs by Cano Ay.
To ease the production the d0repro package of is recommended.
The production release (p17.03.03) is supported from v1_0_1 onward. Please upgrade to this version or newer.
The procedure should be as follows:
1st) start a jobs with
$ sub_production.py <your-dayset-name> p17.03.03
2nd) test completion with
$ check_production.py <your-dayset-name> p17.03.03
in case of partial success go back to step one and resubmit. Please
allow 12hours between job completion and submitting the recovery job.
You should retry until the all files are done, or the remaining
file show an unrecoverable error.
Unrecoverable errors are those which crash two times with the same exit
code in the same event (according to the monitoring page).
In case of success contintue with merging:
3rd) submit merge job with
$ sub_merge.py <your-dayset-name> p17.03.03
4th) test completion with
$ check_merge.py <your-dayset-name> p17.03.03
in case of partial success redo step 3. Please allow 12hours between job completion and submitting the recovery job.
Cleanup of durable location
In order to clean-up the durable location from unmerged thumbnails that are no longer needed
samgrid_utils v2_0_1 or better should be installed. Attention:
The script doesn't work with the "current" version of sam_user_api which is v5_1_0_5.
It has been tested with sam_user_api v5_1_0_9. The version distributed in to our batch jobs via
sam_client is v5_1_0_14. In the local environment this version needn't be marked current.
The following script is mean to be run as user sam in a cronjob. For
now we should issue it manually until we gain some confidence.
source your/jim-ups/setup.csh
setup sam_user_api v5_1_0_9
setup jim_sandbox
setup jim_merge
remove_temp_files.py --duplicate-file-check=True --erased-file-location-check=True
Tasks Performed Within the p17 Reprocessing
Based on the experiences of the p14 reprocessing we want to improve the existing procedure.
In general an improved procedure should be extensively tested before it becomes the baseline in our planning.
The anticipated start date of January 2005 allowed us to gridify the whole project.
SamGrid (aka JIM and SAM) were chosen as a software platform.
Delays caused by finalising the d0realse allowed us to improve the monitoring and the infrastructure a lot.
We expect to start around 20-March-2005.
Each site must prove that it is able to run d0reco and produce results identical to those found at FNAL.
Certification procedure requires each site to process two datasets of ~100files each.
See title link for more information and results.
Remote TMB merging
As the merging of TMBs can no longer be done on d0mino, the merging procedure has to be adapted.
The merging shall be working on remote sites.
Status: Merging scripts were rewritten by Andrew Baranowski and rely on SAM as book-keeping.
Who is doing runjob 6? Iain will handle bug fixes..
Database Proxies (Thomas Nunnemann)
To reprocess from raw data, database proxies are needed to achieve fast enough access to the calibration databases.
Tasks: Prototype installation (done), Installation help for other sites, Field test with many sites,
EDG/LCG people need to investigate how to use this.
Dataset Preparation and Bookkeeping
Dataset creation has been Mike-power intensive and
bookkeeping has been done individually at each site. A common (SAM-based?) bookkeeping would remove doublicated efforts.
SamGrid fully relies on SAM as its book-keeping database. Initial production scripts are provided by DW.
To aide bugtracing the JIM XML DBs are available at each site as a quick diagnostics tool.
Useful checks (DW)
To further automate the error detection we need to define checks that should be run on each
produced file to verify its completeness and integrity.
I'd be good to have the fact that all checks were performed checked as well.
- Check completion (has each input event been processed and been written?)
- Check data integrity (is the output file a valid DST/TMB?)
- Compute and pass on CRC if these tests have passed?
Suggestions of which further checks are useful are very welcome.
Result of each test should be stored with the checks version number as "not checked", "passed", "failed".
Such metadata 'parameters' will be available in SAM v6 only, which
is slightly to late for the reprocessing. We have to live without.
Estimated contribution (Daniel Wicke, Mike Diesburg)
Estimated contributions by each site are linked to the site names at the top of this page.
Presentations of our Status and Plans
All D0 Meeting, 22-Oct-2004.
Wuppertal Group Meeting, 11-Nov-2004.
.....
Daniel Wicke, 13-Apr-2004. Last update 11-July-2006.
wicke@fnal.gov