Reprocessing: Meeting of 14-Mar-2005: 9:30-10:30 ESNet video conference
Our meeting number on the ESNet is 823073776 (82d0repro).
Instructions to dial into a video conference via phone.
Agenda
- News
- Status of central production certification.
- Status of Merge Certification of Sites
- JIM Deployment and Remote Setup
- AOB.
Minutes
Participants
Thomas Nunnemann (Munich), Tibor Kurca, Patrice Patrice (Lyon),
Frederic Villeneuve-Seguier (London), Joe Steele, Joerg Meyer, Mike Diesburg, Daniel Wicke (FNAL)
Topics
- News
- Status of p17.03.02: no build started. As of Friday the code wasn't available to Qizhong. Beside the announced DB client update we expect reduced number of floting point exceptions thus less manual intervention.
- Slight modification of SAMGrid cut: Now jim_job_managers v2_2_41_1.
- In addition samgrid_util will be updated to remove files from the durable location which are no longer used.
- Upgrade of d0repro will appear this week to support p17.03.02 and with many features similar to a request system.
- Status of production certification.
We have plots for Lyon
and Westgrid.
Both show deviations on the same small level as the comparison of STD vs. JIM on the d0farm.
- Status of Merge Certification of Sites
- DØFarm: done.
- WestGrid: dataset 1 and 2 are done with p17.02.00. Plots will appear after the meeting.
- Lyon: done
- SAR: UTA: done
- GridKa: Not merged yet.
- Wisconsin: Screwed dataset 0.
- JIM Deployment and Remote Setup
- GridKa: Still problems with SAM_NAMING_SERVICE. New version of mc_runjob needs to be tested.
- Prague: GG by email: We have finished the investigation of their problems running
reprocessing with SAM-Grid. It turned out that the d0 code in the sam
cache was corrupted. Replacing the tar ball with a new copy solved the
problem. The action item is introducing CRC in the metadata of the
binary that we upload in the system: this should make detection of these
sort of problems easier.
- UK:
- IC: Final tests are ongoing. Merge certification jobs are expected to start this week.
GG by email: the configuration of the system has been cleared and we can now
run single production jobs. We are working on the scalability for 100
jobs. The configuration requires special care at this site, because the
batch system does not provide stage-in capabilities AND there is no
shared file system AND we can only use scp to pull the bootstrapping
file (which tends to fail for 100 concurrent request). We are testing
retrials techniques to increase the pulling efficiency. Once the site
works for 100 concurrent jobs, we'll need to devise a plan with the
system administrators to scale up.
- Manchester: GG by email: the site is functional for production jobs. The station
had a misconfiguration on the routing of the files, which caused our
test merging jobs to fail. We have corrected it and the site is being
tested for merging again.
- RAL: Frederic started to work on installing RAL.
- CMS Farm: --
- AOB:
- Yann and Tibor report that files that are stored after the batch job finished.
This make the XML DB inconsitent with the actual situation. Is there away to
fix the XML DB after the fact? Maybe Gabriele can provide such a script.
- The distributed datasets contain files with larger than 2GB. These aren't marked bad.
These files aren't bad by definition, though they can't be processed. Mike will talk to people on how to get rid off them.
- Is it possible to create a special cp-class for use inside the GridKa site?
Requested in order to remove load from the head node during raw file download.
Action Items:
Next Meeting
21-Mar-2005.
Mike Diesburg, Daniel Wicke, 11-Mar-2005. Last Change 11-Mar-2005.
Diesburg@fnal.gov,
Wicke@fnal.gov