Computing and Software Status

October 2001

·Algorithms
 

RECO
L3

·FNAL FARMs

·Graphics

·Infrastucture

·Online

·Remote Production

·SAM

·Databases

·Simlation


Algorithms

RECO

     Harry Melanson

The status of RECO in the p10 production release is available at

http://www-d0.fnal.gov/computing/algorithms/status/p10.html

I have not yet asked the individual groups to send in their own reports.
 
 

L3

L3 as usual slipped through the cracks--Dan asked for status report late

 
 





FNAL Farms

         Mike Diesburg

Farm Status, October 2001

1234567890123456789012345678901234567890123456789012345678901234567890
Current Production Executables:

Have been running p10.04.00 for production.  Version
p10.07.01 has just been moved to farm.   Testing is in progress.

Farm Configuration:

Batch queues on the farm have been reconfigured to better match
real life running conditions.   We also needed to reset the parameters
on the queues to prevent an interference condition in which jobs on
the gtr queue could block production and vice-versa.       We now have
the following queues defined on the farm:

Production Queues:

production_fast  run only on 750 MHz nodes at high priority
production_hi    run on any production node at high priority
production_lo    run on any production node at low priority
production_slow  run on 500 MHz nodes at low priority

I/O Queues: (on d0bbin)

start
end
job_control
 

Special Purpose Queues:

d0fgtr      global tracking
test        small test queue, 10 nodes
big_test    large test queue, includes all nodes

We believe we have decoupled the interactions between
the gtr queues and the production queues.
 

New Equipment:

Memory has arrived to upgrade the lower 40 nodes to 1GB.
Installation will start later this week (Wednesday, Oct 17th).
This will allow us to use both processors on the older nodes.

The first 16 of the new farm nodes were delivered last
week.   They have been setup in FCC and are being cabled.  The
remaining 16 nodes were delivered to FCC today (Monday, Oct 15th).
Steve Timm believes he will be able to tell within 2-3 days of
starting tests if the controller problem is still present.  It
will take ~1 week to be confident the problem is completely
gone or reduced to an acceptable level.
 

Next Farm Purchase:

Steve Timm and Stan Naymola are putting together a recommendation
of a node configuration that could be purchased quickly and which they
believe is supportable.    This is complicated by the fact that all
qualified vendors currently use this same controller chip which has been
demonstrated to exhibit serious problems.    It is hoped the Promise
controller in used in the current D0 purchase will correct this, but that
remains to be demonstrated (see note above).
More power taps have to be installed in FCC2 before any nodes
beyond the current 32 additions could be used.   Planning for this has
begun, but an exact time line for installation is not yet available.
 
 


Graphics

 


Infrastructure

 During the last month we have continued to make regular "test" releases, one per
week with both debug and maxopt versions of each on Linux and IRIX.
                frozen
  t01.59.00     abandoned
  t01.60.00     Sep 17
  t01.61.00     Sep 24
  t01.62.00     Oct  1
  t01.63.00     Oct  8

Production Pass Releases:
                frozen
  p10.02.00     abandoned
  p10.03.00     Sep 14
  p10.04.00     Sep 21  *** on the reco farms, real data
  p10.05.00     Sep 27
  p10.06.00     Oct  5
  p10.06.01     Oct 11 (data file change only)
  p10.07.00     Oct 12 (hopefully)

NT:  none

OSF: none

Linux, RH7.1 (for ClueD0 mostly)
  We have managed to build a couple of releases, t01.61.00 and p10.06.00 on
d0lxbld9, RH7.1, local disks and throttled way back. It takes 1.5 days/build. So
we've only done the debug ones. But they are available on ClueD0.

  Dugan had built and installed the very latest Linux kernal onto flashflood
(ClueD0, RH7.1, dual 1GHz). Last night (10/10) we managed to get a full build
done on it reading and writting to the /d0dist/dist/ disk nfs served from d02ka.
This is the first time we've managed to do a full build without hanging the
machine. There were 286 "broken" packages. But most of those were due to missing
or incorrect external packages. A number were also due to known code
incompatabilites in our code for which we have patches but hadn't bothered to
put into the ClueD0 build. We are now correcting those problems and will repeat
the exercise. Since we should be building more packages, we may hang again.
 

Build Resources:
  Memory
    I'd heard that we have more memory for all the machines but it hasn't made
it into them yet. Why? I'd give the OK to do it at will. bld4 had 4GB (2 usable
due to RH 6.1) bld9 has 1GB, the rest have .5GB. Until that memory is installed,
the .5GB machines are only of marginal use. We can't build on them.

  Disks
    We have asked for 140GB of additional disk space on d0mino but haven't
received it yet. That'll take us to 200GB on d0lxbld4, 300GB on d0mino. That
should keep us happy for a week or so.
 


Online:

 Major activities of the Online group, mostly in support of
 shutdown activities (or trying not to get in the way of shutdown
 activities):
 
 - Hardware:
     - Used a vist by the Compaq service engineer to install
       upgraded "I/O Riser Modules" in d0olc.  Supposed to
       give better network and disk I/O performance.  Haven't done
       any conclusive tests yet to see if any improvement.
     - Installing new nodes:
         - Win2000 Terminal Server, to be used for external
           access to any Windows node
         - DELL 4400 server, to be used as home of luminosity
           monitoring applications
         - 2 Linux "gateway" nodes, dual-homed between Online and
           Beams network which will be used for ACNET communication
         - Still awaiting (supposedly shipped) 8 Linux Networx nodes
           to satisfy DAQ and EXAMINE needs
     - Installed DLT4000 backup tape drives on d0ola/b/c
     - Extended Offline network to MCH and Collision Hall to allow
       easier use of laptops
 
 - Controls: 
     - Swapping in improved 1553 controller modules, hoping to
       address errors seen on long cables to platform
     - Steadily configuring more front ends to generate alarms,
       now need detector groups to set sensible limits
     - Loading front ends from ORACLE database derived EPICS
       configuration
     - Added verify option to COMICS downloads
 
 - DAQ
     - Will be testing "alarm enabled" DAQ processes on Thursday
     - Working on proper method for making direct copies of data
       from Online to d0mino disks
     - Various L3 tests and upgrades, aiming for multiple Segment
       Bridges and higher rates
     - Burning in 48-node L3 Linux farm
     - Significant progress in coupling L3 Linux farm to DAQ chain,
       expecting to come out of shutdown running filters in Linux
       nodes
 
 - Computer Security
     - Working on isolation tests, Kerberos configuration, and ACLs
 
 - Run 2B
     - Produced segment of Run 2B TDR on Online plans
 
 - Continuing issues
     - Still need unified EXAMINE leadership

Remote Farms
 mcp10 has been released to the farnms and overlap events are being
generated. mcp10 uses p10.06.01 for all phases of reconstruction at this
point in time. We will be upgrading to d0reco p10.07.00 before processing
any requests.

 mcp10 should be the last major relase to the farms before Moriaond. We
plan on upgrading the version of d0reco on a regular basis to the same
version used on data.

 Software Problems:

        A fix for single particle production is being implemented.

        Progress is being made on processing files placed in SAM by the
        physics groups.

        Progress is being made on running on data (A. Kupco)

        The new Metadata and Request system is making good progress.


SAM Data Handling System

 http://www-d0.fnal.gov/~lueking/sam/cpb_report_20011015.ppt
 
 

Databases

Simulation

Simulation Status (Oct. 2001)
============================

In general, we met the Oct. 15 dead line for p11.
That means, all major features for p11 of
generators/d0gstar/d0sim/d0raw2sim/d0trigsim programs
are in cvs already. After this week's test release,
we will start to test and fix bugs.

Generators:
-----------
 Ia Iashvili has added code to MCpythia to better
 handle parton information and that MCSingle.x can
 be run in conjunction with MCisagen.x to handle
 single particle generation with decays (like tau's
 or B's).

Dogstar:
--------
 No change in d0gstar from p10 release, except for
 a printHit method was added for event dump in
 every SimXXXHitChunk.

 d0gstar simulates hits for all detectors in Dzero.
 Both calorimeter plate level geometry and mixture
 geometry are supported. It uses the new 3D filed
 map, as default. There is no known bug.

D0Raw2Sim:
----------
 In order to get more realistic simulation for MC
 events, we designed the D0Raw2Sim program, which
 converts the real raw data into a pseudo-SimChunk
 for every subdetector, so that in D0Sim can merge
 real zero-bias data with the MC events.

 D0Raw2Sim is planned for p11 production release.
 It has been in cvs for a while. D0Raw2Sim has code
 for all subdetector packages and runs without crashing.
 A problems was introduced with smt data trying to
 access database. This is fixed in cvs now.
 D0Raw2Sim has been tested with MC events only.
 It runs without crashing but when D0Sim tries
 to use the output it crashes in smt, cft and fps.

 We could not test it on the real zero-bias data,
 because the only zero-bias runs we got are having
 problems with smt data. We need to have a good
 zero-bias run to test D0Raw2Sim.

D0Sim:
------
 D0Sim runs fine with plate and mixture MC. A lot of
 efforts put in understanding the difference between
 plate and mixture MC events. So far, all differences
 are as we expected. The plate MC in p10 can be
 officially used by the remote farms for the physics
 group's needs.
 Packages have code to handle merging with real zero
 bias data, but we need good zero-bias runs to test it.

PMCS:
-----
 pmcs is suffering from a severe lack of manpower.
 however, despite this, Elemer and Zhong Min have
 managed to add new tracking code, many bugs have
 been fixed, and chunk output now works for all particles.
 several things are still needed:
   1) good jet smearing
   2) good met smearing
   3) lots of plots verifying performance
   4) good way to easily do pileup
   5) interface to old CMS generator
 sarah has 3 new students, and maybe they can get this done?

TrigSim:
--------
 Production releases from p10.02.00 onward are certified (run without
crashes). Some files are known to produce floating point exceptions which
appear in L3 tracking, this is being investigated. The error rate among
tested files is less than 5% (ie. less than 5 files in 100 have at least
one event that trigsim doesn't like).

 Recent system-wide efforts have concentrated on packing/unpacking real
data (L1), configuration of all three trigger levels via coor-generated
config files (coor inputs from triggerDB), integration of the user
interface into the d0tools package and integration with the new runtime
environment being developed for D0 executables. Of course many individual
processors, tools and filters have been upgraded in the last month as well
but this is too much to detail.