·SAM
http://www-d0.fnal.gov/computing/algorithms/status/p10.html
I have not yet asked the individual groups to send in their own reports.
1234567890123456789012345678901234567890123456789012345678901234567890
Current Production Executables:
Have been running p10.04.00 for production. Version
p10.07.01 has just been moved to farm. Testing is in progress.
Farm Configuration:
Batch queues on the farm have been reconfigured to better match
real life running conditions. We also needed to reset the
parameters
on the queues to prevent an interference condition in which jobs on
the gtr queue could block production and vice-versa.
We now have
the following queues defined on the farm:
Production Queues:
production_fast run only on 750 MHz nodes at high priority
production_hi run on any production node at high
priority
production_lo run on any production node at low priority
production_slow run on 500 MHz nodes at low priority
I/O Queues: (on d0bbin)
start
end
job_control
Special Purpose Queues:
d0fgtr global tracking
test small test queue, 10
nodes
big_test large test queue, includes all nodes
We believe we have decoupled the interactions between
the gtr queues and the production queues.
New Equipment:
Memory has arrived to upgrade the lower 40 nodes to 1GB.
Installation will start later this week (Wednesday, Oct 17th).
This will allow us to use both processors on the older nodes.
The first 16 of the new farm nodes were delivered last
week. They have been setup in FCC and are being cabled.
The
remaining 16 nodes were delivered to FCC today (Monday, Oct 15th).
Steve Timm believes he will be able to tell within 2-3 days of
starting tests if the controller problem is still present. It
will take ~1 week to be confident the problem is completely
gone or reduced to an acceptable level.
Next Farm Purchase:
Steve Timm and Stan Naymola are putting together a recommendation
of a node configuration that could be purchased quickly and which they
believe is supportable. This is complicated by the
fact that all
qualified vendors currently use this same controller chip which has
been
demonstrated to exhibit serious problems. It is hoped
the Promise
controller in used in the current D0 purchase will correct this, but
that
remains to be demonstrated (see note above).
More power taps have to be installed in FCC2 before any nodes
beyond the current 32 additions could be used. Planning
for this has
begun, but an exact time line for installation is not yet available.
Production Pass Releases:
frozen
p10.02.00 abandoned
p10.03.00 Sep 14
p10.04.00 Sep 21 *** on the reco
farms, real data
p10.05.00 Sep 27
p10.06.00 Oct 5
p10.06.01 Oct 11 (data file change only)
p10.07.00 Oct 12 (hopefully)
NT: none
OSF: none
Linux, RH7.1 (for ClueD0 mostly)
We have managed to build a couple of releases, t01.61.00 and
p10.06.00 on
d0lxbld9, RH7.1, local disks and throttled way back. It takes 1.5 days/build.
So
we've only done the debug ones. But they are available on ClueD0.
Dugan had built and installed the very latest Linux kernal onto
flashflood
(ClueD0, RH7.1, dual 1GHz). Last night (10/10) we managed to get a
full build
done on it reading and writting to the /d0dist/dist/ disk nfs served
from d02ka.
This is the first time we've managed to do a full build without hanging
the
machine. There were 286 "broken" packages. But most of those were due
to missing
or incorrect external packages. A number were also due to known code
incompatabilites in our code for which we have patches but hadn't bothered
to
put into the ClueD0 build. We are now correcting those problems and
will repeat
the exercise. Since we should be building more packages, we may hang
again.
Build Resources:
Memory
I'd heard that we have more memory for all the machines
but it hasn't made
it into them yet. Why? I'd give the OK to do it at will. bld4 had 4GB
(2 usable
due to RH 6.1) bld9 has 1GB, the rest have .5GB. Until that memory
is installed,
the .5GB machines are only of marginal use. We can't build on them.
Disks
We have asked for 140GB of additional disk space
on d0mino but haven't
received it yet. That'll take us to 200GB on d0lxbld4, 300GB on d0mino.
That
should keep us happy for a week or so.
mcp10 should be the last major relase to the farms before Moriaond.
We
plan on upgrading the version of d0reco on a regular basis to the same
version used on data.
Software Problems:
A fix for single particle production is being implemented.
Progress is being made on
processing files placed in SAM by the
physics groups.
Progress is being made on running on data (A. Kupco)
The new Metadata and Request system is making good progress.
In general, we met the Oct. 15 dead line for p11.
That means, all major features for p11 of
generators/d0gstar/d0sim/d0raw2sim/d0trigsim programs
are in cvs already. After this week's test release,
we will start to test and fix bugs.
Generators:
-----------
Ia Iashvili has added code to MCpythia to better
handle parton information and that MCSingle.x can
be run in conjunction with MCisagen.x to handle
single particle generation with decays (like tau's
or B's).
Dogstar:
--------
No change in d0gstar from p10 release, except for
a printHit method was added for event dump in
every SimXXXHitChunk.
d0gstar simulates hits for all detectors in Dzero.
Both calorimeter plate level geometry and mixture
geometry are supported. It uses the new 3D filed
map, as default. There is no known bug.
D0Raw2Sim:
----------
In order to get more realistic simulation for MC
events, we designed the D0Raw2Sim program, which
converts the real raw data into a pseudo-SimChunk
for every subdetector, so that in D0Sim can merge
real zero-bias data with the MC events.
D0Raw2Sim is planned for p11 production release.
It has been in cvs for a while. D0Raw2Sim has code
for all subdetector packages and runs without crashing.
A problems was introduced with smt data trying to
access database. This is fixed in cvs now.
D0Raw2Sim has been tested with MC events only.
It runs without crashing but when D0Sim tries
to use the output it crashes in smt, cft and fps.
We could not test it on the real zero-bias data,
because the only zero-bias runs we got are having
problems with smt data. We need to have a good
zero-bias run to test D0Raw2Sim.
D0Sim:
------
D0Sim runs fine with plate and mixture MC. A lot of
efforts put in understanding the difference between
plate and mixture MC events. So far, all differences
are as we expected. The plate MC in p10 can be
officially used by the remote farms for the physics
group's needs.
Packages have code to handle merging with real zero
bias data, but we need good zero-bias runs to test it.
PMCS:
-----
pmcs is suffering from a severe lack of manpower.
however, despite this, Elemer and Zhong Min have
managed to add new tracking code, many bugs have
been fixed, and chunk output now works for all particles.
several things are still needed:
1) good jet smearing
2) good met smearing
3) lots of plots verifying performance
4) good way to easily do pileup
5) interface to old CMS generator
sarah has 3 new students, and maybe they can get this done?
TrigSim:
--------
Production releases from p10.02.00 onward are certified (run
without
crashes). Some files are known to produce floating point exceptions
which
appear in L3 tracking, this is being investigated. The error rate among
tested files is less than 5% (ie. less than 5 files in 100 have at
least
one event that trigsim doesn't like).
Recent system-wide efforts have concentrated on packing/unpacking
real
data (L1), configuration of all three trigger levels via coor-generated
config files (coor inputs from triggerDB), integration of the user
interface into the d0tools package and integration with the new runtime
environment being developed for D0 executables. Of course many individual
processors, tools and filters have been upgraded in the last month
as well
but this is too much to detail.