Here is the reconstruction summary for the past week.

Mike Diesburg was on shift:

A full report of our problems for this week is at the end of the message.

Friday the 7th had some downtime due to recovery from problems the previous week.

Over the weekend, both tape drives in our filewidth jammed.  This caused all encp stores to fail
in a very non-transparent way.  We spent until ~ Wednesday working on recovering from the resulting overflow of all of the farm buffers.
 

Monday there was work on the enstore system and the mounts of the encp disks did not come back properly, it took until Tuesday to get this fixed

Back in 'normal' production on Wednesday

fnd026 lost a system disk and was removed from the system

helpdesk reports: 20955,20933, 20884 available at: http://csdserver1.fnal.gov/HelpDesk/cd/
 
 
 
 


 

Summary of production.  reconstructed 360,873 events (probably did more and lost due
to copy problems when disk filled).  Root numbers are less because cannot store root
file unless ALL associated reco files were stored successfully.
 
 

Raw data produced between 09/07/2001 and 09/13/2001
File Count:  500L
Average File Size:  277192L
Total File Size:  138596123L
Total Event Count:  767680L
Raw data between 09/07/2001 and 09/13/2001 which had reco version t01.56.00 run on it
File Count:  96L
Average File Size:  269642L
Total File Size:  25885670L
Total Event Count:  160710L
Raw data Reco output files for version t01.56.00 produced between 09/07/2001 and 09/13/2001
File Count:  819L
Average File Size:  408133L
Total File Size:  334261308L
Total Event Count:  360873L
Raw Data Root files for version t01.56.00 produced between 09/07/2001 and 09/13/2001
File Count:  146L
Average File Size:  15805L
Total File Size:  2307664L
Total Event Count:  285456L

 

Full report on problems over the weekend.
 

Over the weekend things the worker nodes ran great but we ran
into problems with file stores

1) the total number of working MII tape drives may have dropped to 1 at some
point.  George S. came in and got it back to 3 or 4 about 5PM on Sunday after
we made a report to helpdesk

1a) Mike found out that the underlying problem is that the file family
width for the farms is 2 instead of 5 as it used to be and both drives
got stuck, then the queue got really long and then it got too long,
leading to 'socket error' messages from enstore.

We need to insist that this goes back up to something rational like 4.
Due to output file expansion, we may in fact be writing
at 40 MB/sec out of d0bbin just to keep up.

We also need better diagnostics, 2 stuck drives leading to queue failure
leading to socket errors on timeout is not very informative.

2) the priority for the farms in encp was ~ 0 so we didn't have
a chance at those drives.  Carmenita proposed a fix and Mike installed
it.  This helped once the underlying problem with no usable drives was fixed

3) Farms can build up so many store processes that they run out of swap
space.  I think this has the effect of 'killing' stores so that we don't
even ask encp for the small number of drives.
farm-admin is looking into adding a new swap disk and upping the
memory on d0bbin to 2 GB.

4) Storing multiple files from one input file can cause a real mess.  We
basically do not want anything to happen unless all files can be
transferred correctly to d0bbin for storage. This means that we need to
assure ~ 10GB of space on d0bbin disk before initiating the transfer,
not the size of individual files.

This another fallout from the multiple output file fiasco.  I'm working on a
fix for this.  The fcpmany  script will block on > 90% full for now.
May wish to have an integrated store script
which deals with all files associated with a given input file.

5) runrecocert does not have a good algorithm for assigning disks
up front - we need to go to a round robbin with checking rather
than the current 'smallest disk at start of job' algorithm.
Mike is working on this.

Bottom line:

We had a very nice test of high rates out of the farm.  We found
serious design problems in both our scripts and the data storage
configuration.  Some of the problems have been fixed but not all.