http://www-d0.fnal.gov/computing/algorithms/status/p10.html
I have not yet asked the individual groups to send in their own reports.
Running Problems:
Farm systems have been experiencing
significant degradation
of throughtput for a couple of weeks due to various problems
associated with storing the output files into SAM.
Root cause of he problems
seems to be unreliability of
Mammoth hardware. This has led to backups in output
queues, filled output buffers, and loist store requests.
To help alleviate this we
have widenened the farm output
file family to 4 and installed more robust versions of
encp.
We are still experiencing
occasional storage failures
with a "Broken pipe" error. This is a network error
of
unknown origin. The error rate for this is at the few percent
level.
Production Version Status:
Farms are still running
t01.56.00 as the standard
production exectable. Version p10.04.00 recocert sample
has
been run through the farms.
Many of the failed reco_analyze
jobs have been tracked
down to what appears to be a crash in the exit code of
reco_analyze. Harry Melanson is actively exploring
ways to
debug this and the possibilities of a temporary patch that
would allow the root files to be used.
Shift Training Status:
An initial training session
was held during the
collaboration meeting where basic farm operations were
covered as well as SAM shift operations.
We expect to start one or
two farm shifters
within the next two weeks doing basic job submission chores
and forwarding problems to resident experts.
New Farm Nodes:
Current delivery date for
the 32 new farm nodes is
Oct 2nd. One evaluation sampleof the nodes has been
delivered
to the farm group prior to shipment for verification.
The FSU group continues to work on prototype 2-D classes for D0Scan.
Online D0ve crashes on SMT unpacking. Experts are being consulted.
Recent changes to muon segments can not be retrofitted to online T1.51
and await installation of version T1.58 or higher.
Production Pass Releases:
frozen
p09.07.00 Aug 17
p08.13.00 Aug 21
p09.08.00 Aug 24
p10.00.00 Aug 24 (identical to
t01.56.00)
p09.09.00 Aug 27
p09.10.00 Sep 4
p10.01.00 Sep 6
NT: This may be the final NT build until we do the wrapup one to archive
the
absolute final version. They are currently doing "local" quick builds
trying to
get something that works, recognizing that this data isn't and never
will be
publishable for Physics.
frozen
pnt09.08.00 Aug 23
OSF:
onl01.58.00 is being done now.
Linux, RH7.1
We have been attempting to begin RH7.1 builds as well. We have two
machines,
d0lxbld9 and 10 running RH7.1 mounting the /d0dist/ disk r/w from d02ka.
This
isthe disk that is exported to the clued0 cluster. Apparent nfs problems
have
prevented this from working. The build machine has "hung", perhaps
taking down
d02ka every time we've tried it. We are working with the sys-admins
trying to
solve these problems, but so far without much success. We have gotten
one build
done, building on a clued0 machine. But even it hung on the first try.
NOTE:
there still is no official FermiRedHat Linux 7.1 version available.
So a lot of
this is "cutting edge".
Build Resources:
Memory
We have hit a "wall" on the Linux build machines. Executables have
gotten so
large that if we get more than a couple being linked on d0lxbld4 (1GB
of
memory)
at once, we'd begin swapping which slowed the builds so much that it
took a
*very* long time to build and sometimes failed entirely.
The memory on d0lxbld4 has since been doubled, but we still can't do
the 4-6
builds simultaneously that the release schedule requires. We have been
unable
to
use any of the other d0lxbld machines either. They have less memory
and fewer
processors, but the big problem is that they need to nfs mount the
d0lxbld4
/d0dist/ disks. nfs on Linux is "buggy" enough that we *always* get
several IO
errors during a build. Depending on exactly when those errors occur,
the entire
build or only a portion of it is junk. In the worst case, close to
being the
usual one, enough of the build is junk that the entire thing has to
be
restarted
from the beginning.
Disks
In the last month we have doubled the disk space available to our builds
on
Linux (to 200GB). We now need to add a significant increment on IRIX
which is
already at 170GB. The builds on IRIX are 20-30% bigger than on Linux.
In
addition it is the "master" distribution node. So we *must* have all
the builds
there, including OSF, NT etc. These aren't large, but they add up.
There are some problems with recoanalyze which will be addressed
with a new release.
Software Releases:
p09.10.00 - d0gstar is broken leading to segmentation
faults on
Linux machines. Until this is
fixed we cannot run p09 on the
farms. mc_runjob has been configured
to work with p09.10 and is
ready to go. All other pieces
of p09 appear to work. The 500 k
trigger request cannot start until
this is complete.
Runtime Environment - Jonathan hays has a proposal to rework
the
DZero runtime environment to make
it more portable. David
Ritchie (Fermilab) is going to
help upgrade packages to support
this new idea. This is being
done in conjunction with
mc_runjob and d0tools
Software development
GRID - On Wednesday 12 September will have working
meeting to
examine interfacing grid tools
with SAM
mc_runjob - Need to integrate sam executables into
mc_runjob.
Hardware activities:
- Racks and nodes for L3 Linux farm are here. There are
some issues
getting appropriate power to the racks, and there
will likely be a
kluge solution until electrical work can proceed
during the October
shutdown. Some parts (network patch panels,
power distribution) are
missing from Linux Networx, but we'll likely go
ahead with installation.
- Some slight network upgrades in the Gb network handling VRC
to L3 farm
traffic. There is now a Gb uplink to the 6509.
- 8 new Linux nodes for Examines, etc are > 3 weeks late.
Software activities:
- Lots of work on security and ACLs, slowly and painfully converging;
will have shutdown activities on Kerberization and
"disaster recovery";
making sure all applications properly configured
to work within ACL
restrictions.
- Steady improvements to controls applications and diagnosis
of sticky
problems with 1553 (which may have hardware origins);
pushing ahead
on alarm system
Software concerns:
- Have no one to guide and direct Examine efforts, nor to understand
the
low level issues. Need to use trigger selection
capabilities.
- Graphics appears to have little direction and momentum.
Problems:
1. Stu has had to open up ACLs for 131.225.222 subnet to allow
packets
from d0ora1 that are coming through the wrong interfaces to the onlinesystem.
We are trying to find and fix this.
2. We have recently had problems with project masters not calling back
consumer clients on central analysis. Igor looked at this yesterday,
I
am not sure what he found.
3.Although we planned and coordinated the python v2.1 upgrade, there
were some problems which have been worked out. There still seem to
be
some problems with 2.1a.
4. usual noaccess volume issues.
5. Some issues with contention for tape drives among online, farms,
and
other offline users has been observed
Plans
1.We hope to have the 3.2 release ready by the end of Sept.
2.Some work will be required to move to LTO's for storing MC data,
and
9940's for detector data. We hope to transition the MC asap, determined
by when the tapes and robot bins arrive and are installed in the robot.
The 9940s will certainly be ready for data taking after the shutdown.
3.Our next major release will be scheduled for sometime in Mid
November. Among the things beingdiscussed for this release include
a) transitioning to OmniORB to
replace fnorb, the python module,
b)new features to trace the life cycle of data set definitions and
datasets for each group,
c) a new command
parser that has better help included.
4.Work on displays for SC2001 should provide some neat visual views
of
the operation fo the system that
will be fun to watch, and probably useful to detect problems.
D0gstar/D0sim:
Mainly working on p10 production release.
Changes in D0gstar for p09 and p10:
- All detectors,including
LUMI and FPD.
- New 3D magnetic
field map
- New interface to
access to the field configuration.
- Default ECUTs for
calorimeter plate level geometry were
lowered to get
more accurate response.
- A new option to
swim very forward p/pbar to FPD in double
precision was
added. This option is OFF as default.
- New access methods
on the option in event information.
- Changed from GEANT
3.21.12 to GEANT 3.21.13 (Geant bug fix).
Changes in D0sim:
in p09.10.00:
- Turned on smt noise
simulation,
- use cal only optimized
weights in L1CalTTChunk.
- Handle both mixture
and plate geometry.
in P10:
- CalDataChunk uses
cal only optimized weights
- added code to handle
pseudo SimChunks.
To test p10 release, a set of MC samples are
generated through
D0gstar and D0sim. The MC verification sample
include
Z->ee, mumu, tautau, ttbar and single
electron. We have generated
20 files with mixture or plate geometry, and
with 2.5mb or 0mb
separately.
CPU study and output size studies are done
on p10 verification
samples.
All these files are tested with mc_examine
and with each
subdetector examines. The check on RawDataChunk
looks ok.
We certified that p10 d0gstar/d0sim can be
put on the MC farm.
A lot of efforts put into the comparision
of the mixture
and plate Monte carlo.
PMCS:
The general framework of PMCS is in place and the
current work
focuses on providing the smearing functions
for the various
physics objects.
Major effort now to put pmcs output into d0phyobj
chunk, and to
simulate the variables in chunk properly.
D0TrigSim (by Oneil)
P10 contains a lot of big changes for d0trigsim
infrastructure.
For example L1/L2 changes to run on real data:
- l1l2unpacker learns
VRB format.
- more analyze packages
read directly from RDC (l1cal, l1muon,
l2caljet, l2calmet,
l2calem, l2cps, l2gbl).
- tsim_l1ft completely
rewrites L3 output classes to better
emulate real
data. New unpacker to be written for l1ft analyze
package and
examine.
More functionality added to l2gbl (track match),
L3 filter/tools,
L2errorlogger integration, etc.