P14 Reprocessing: Meeting of 12-Nov-2003: 10:00-11:00 VRVS Desert.
Agenda
- News from CPB.
- Site Certification (Intro by Phil?).
- Transport speed (Intro by Rod).
- Status of Sites.
Minutes
Participants
Rod Walker (Imperial), Willem van Leeuwen (NIKHEF), Roberto Chierici (Bern), Dugan ONeil (Vancouver), Michel Jaffre (Orsay), Thomas Nunnemann (Munich), Jeff Templon, Mike Diesburg (FNAL), Phil Lewis (FNAL), Patrice Lebrun (Lyon)...
Topics
- News from CPB:
- Restart reprocessing using p14.05.02, which fixes a bug found in certification. Remaining problems in p14.05.02 will be repaired by TMBfix.
- At the same time switch to non-selective reprocessing.
- Once the reprocessing is running some site might switch to MC (reprocessing or production) depending on the efficieny which we can reach.
- Site Certification:
- Phil: FNAL plots are missing.
- Five partitions were made at Cluedo (non-selective) compared to non-selective from Lyon. Differences remaining can't come from Processor differences. Maybe it's due to possible DB access on Cluedo (mag field).
- Remote sites differ only by processors. All differences are considered small.
- Transport speed:
- Rod: Central router was getting very low percentage of transfere requests. Reconfigured in SAM: New station rp-router with higher priority to enstore.
- All site should switch to rp-router. In case of problems we can setup another station.
- 65GB cache of the new station shouldn't be a problem. It only needs to hold as many files as sim. transports are going on.
- New datasets will be assigned by Mike which should be used for stressing the new station. All sites should start the transport on Thursday 10am CST.
- Storing DSTs: should be done by the sites. We'll provide an update of the copy scripts which check the number of events in the TMBs and DSTs before transportin/storing them.
- Site Reports:
- GridKa: Th. Nunnemann: All datasets assigned are processed. In selective reporcessing a file might have no selected event which makes it look like not being processed. (No problem in future)
- Lyon: Stopped by data transfer. 8.4M transported, 8M processed, using 10CPUs only. LHC experiments are expected to start the challenge soon.
- Nikhef: Running on NIKHEF local farm (not EDG) and storing files to d0mino, but permissions are such that files are only readable for SAM.
- NPACI: -
- SAR: Expect another 1 or 2 weeks before ready to go.
- UK: Currently running RAL only, Imperial and Manchester are ready to go, but due to the transport speed not used at the moment.
- WestGrid: Both sites up. Transport is working (after fixing IP-routing tables). Tracing problems arising from worker nodes being on a private network.
Action Items:
- Prepare an update of the copying script which checks the number of events before transporting/storing it (Mike and Daniel).
- Start transport of newly assigned datasets using the new routing station "rp-router" on 13.11.2003 at 10am Chicago time (All sites).
- Swtich to p14.05.02 as soon as it appears an redo the certification dataset (All sites).
- Create Tarball for p14.05.02 asap (Ian)
Next Meeting
19-Nov-2003 10:00CST
Mike Diesburg, Daniel Wicke, 5. Nov. 2003. Last Change 14. Nov. 2003.
Diesburg@fnal.gov,
Daniel.Wicke@physik.uni-wuppertal.de