Reprocessing: Meeting of 23-May-2005: 9:30-10:30 ESNet video conference
Our meeting number on the ESNet is 823073776 (82d0repro).
Instructions to dial into a video conference via phone.
Agenda
- News
- Status of remote site Certification
- JIM Deployment and Remote Setup of remaining sites.
- Status of production
- AOB.
Minutes
Participants
Yann Coadou, Sabah Salih, Joel Snow (SAR/OSCER ), Joe Steele, Gabrielle Garzoglio, Michael Diesburg , Peter Love
Topics
- News
- Status of production
- WestGrid: Half of cluster was sed by Atlas last week, so not a lot of production was done. Restarted
on Friday but had some initial troubles after the reboot of d0ora2 earlier in the week. This was eventually
cleared by restarting all the DB servers. The Head node has been fairly stable. It has gone ~9 days without
a crash. It is believed that the crashes are associated with some activity that Atlas is doing.
The new durable location on the Storage Facility is in place. Seems to be working OK so far. WestGrid
now basically has unlimited storage space. Should make it easy to run production during week and merge on
weekends. Don't yet know if this will help with the congestion problems of running production and merging
simultaneously. So far the most comman cause of job failures has been communication failure with the
FNAL Run Config DB server. May need to setup a proxy for this as well as Thomas has suggested in the
past.
- Lyon: No one present
- SAR: Joel reports that OSCER started its first dataset at Oklahoma. There was ~6% failure rate on the first
set. No investigation done yet into causes for failures. They have started the second dataset. Joel is
lobbying to get more nodes allocated. Datasets are needed for Wisconsin running.
Joel says he sees no interference on Head node from Atlas operations like Westgrid has reported.
- Prague: No one present
- GridKa: No on present (Gabrielle reports they are still having NFS problems and difficulties making scratch areas).
- Manchester: Certification has been submitted. Joe says to proceed. Have ~14TB space available for prestaging. Need
datasets.
- Lancaster: Peter is bringing new cluster up in his spare time. Grid certificate has been requested.
- Status of remote site Certification
- Status of production certification.
- Prague:
- SAR: UTA:
- SAR: Oscer:
- UK: IC:
- CMS Farm:
-
- Status of Merge Certification of Sites
- DØFarm: done.
- WestGrid: done.
- Lyon: done.
- SAR: UTA: done
- SAR: Oscer: done
- GridKa: done.
- Wisconsin: done.
- Prague: done.
- UK: IC:
- UK: Manchester:
- JIM Deployment and Remote Setup of remaining sites.
- Gabrielle has been investigating IN2P3 problems. Found last week that several critical files in installation were
zero length. Checked a working installation and found that some were normally zero length but others were clearly
not as they should be. IN2P3 people should continue submitting jobs and JIM people will monitor if problem re-occurs
More robust batch handler was installed for IN2P3. Work on CMS farm is ongoing. CONDOR was found to be configured
to force pre-emption which caused lots of jopb problems. This has been corrected and they have gone back to certification.
- RAL: Required user accounts now available. Installation about to start.
- Lancs: --
- AOB:
Action Items:
Next Meeting
23-May-2005.
Mike Diesburg, Daniel Wicke, 16-May-2005. Last Change 16-May-2005.
Diesburg@fnal.gov,
Wicke@fnal.gov