Reprocessing: Meeting of 29-August-2005: 9:30-10:30CDT ESNet video conference
Our meeting number on the ESNet is 823073776 (82d0repro).
Instructions to dial into a video conference via phone.
Agenda
- News
- New JIM release cut (September release cut; coexistance with MC, bug fixes, new KCA fingerprint)
- Status of remote site Certification
- JIM Deployment and Remote Setup of remaining sites.
- Status of production
- Web interface for reprocessing in Prague (Vlastislav Hynek)
- AOB.
Minutes
Participants
Tibor Kurca (Lyon),
Joel Snow (Oklahoma),
Vlastilav Hynek,
Andrew Baranovski,
Parag Mhashilkar,
Gabriele Garzoglio,
Eduardo Gregores,
Mike Diesburg (FNAL),
Daniel Wicke (Wuppertal)
Topics
- News
- Mike is going to rerun the accounting to cross-check missing datasets.
At the moment it seems that theres only very small datasets left prestaged to sites.
All sites are expectged to finish this by October.
80M events are still unassigned.
- Assigment of resources to MC production. Gavin will circulate a draft assigment
of balancing MC production with reprocessing. Site are expected to comment.
MC is not to fill in cycles. It is needed for physics.
- Status of production
- D0Farm: CPU limit on SamGrid production has been raised to 400.
Immediately ran into problems with size of the durable location.
Job throtteling is not fully functioning for the site. A fully ordered execution of jobs would reduce the average lifetime of files in the durable location. Under investigation by teh SamGrid team.
Changed cleanup script to wait 3h instead of 24h.
Is that wait time still needed if we use --duplicate-file-check=True ?
GG: check on location still needed for MC
Does the cleanup script check for locations of the merged files? GG: probably not.
- WestGrid: --
- Lyon: Started to process non-prestaged (reassigned) data. Now limited by transfer rates.
Lyon observes a 4-5% failure rate in sam-store for production jobs. Files aren't declared to SAM.
GG suggests to add a retry mechanism in mc_runjob (makes sense again due to improved SamGrid interface).
In some cases unmerged-thumbnails have disagreeing number of events.
At production level everything seems fine.
MD: suggests to run the input file manually through dsdump. Do the affected files have disk locations in common?
Can we get a "analyzed_status good" dimenstion to allow to recover from such problems? DW will check with Robert
and Adam.
- DØSAR-Oscer, CMS-Farm, Wisconsin: Load problems. Joel will configure SamGrid for throttling.
Joels submission site is down and needs to be fixed (should be done by the end of the week).
- DØSAR-UTA: --
- DØSAR-Sprace: Crashes during high load during a large number of merging jobs.
It is suggested to compare kernel version to that of other sites. Information should be in the mail-archives.
- Prague: No strong problems.
- Imperial College: Gavin and student are keeping production up during Frederics absence.
- Manchester: --
- GridKa: --
- Status of remote site Certification
- Status of production certification.
- Status of Merge Certification of Sites
- GridKa: done. But expected to run remaining two dataset.
- JIM Deployment and Remote Setup of remaining sites.
- Web interface for reprocessing in Prague (Vlastislav Hynek)
- New JIM release cut (September release cut; coexistance with MC, bug fixes, new KCA fingerprint)
MC web page does list a newer release. Will check with GridKa whether any reprocessing related issues are observed?
The September shall include the new KCA fingerprint.
It was agreed, that we should coordinate future upgrades with the MC effort.
The reprocessing web page will continue to mark new releases as current only
after some tests at a restricted number of sites (usually only one site).
- AOB:
- Log files from submission sites: Samgrid.fnal.gov is out of disk space. What do we want to do with these files?
Amber will provide disks for a limited time (3 months).
Postmortem discussion on deleting those file shall be held during the collaboration meeting in December.
Next Meeting
12-Sep-2005
Mike Diesburg, Daniel Wicke, 22-Aug-2005. Last Change 29-Aug-2005.
Diesburg@fnal.gov,
Wicke@fnal.gov