Computing and Software Status

November 2001

·Algorithms
 

RECO
L3

·FNAL FARMs

·Graphics

·Infrastucture

·Online

Remote Analysis

·Remote Production

·SAM

·Databases

·Simlation

---

Algorithms

RECO

     Harry Melanson

The current status of reco is summarized at http://www-d0.fnal.gov/computing/algorithms/status/p10.html 

Some highlights:

1)      p10.10.00 exists, but does not include all changes that we have intended for processing post-November shutdown data.

2)      Known changes that are scheduled to be include in p10.11.00 (to be built on November 20) are:
a) implement agreed-upon official jet algorithms
b) include a significant number of "muon" fixes
c) inclusion of "special event stream" selection criteria (W, Z, ??? selection) (Some of these are being tested now, and seem to be OK so far...)

3)       It is not yet clear whether the current code can be expected to be able to handle new CFT electronics. This is under investigation, but could easily be "an issue". Regardless of the outcome, it is highly anticipated that we will need to make pass releases to p10 to respond to a "new" CFT detector. (This is a good thing!)

4)      The changes released with p10.10.00 continue to respond to "real needs". GTR tracks now include hit masks, as requested by the bc id group. The primary vertex cuts have been tuned a little to deal with current real data track error matrices. The cellNN algorithm has been modified to improve its speed. The MET calculation includes lowered tower thresholds.

5)      In addition, the default mode of running reco now includes NADA in killing mode for both data and MC. For the record, this change is still awaiting reports for physics groups to confirm it as "reasonable" for now (maybe to be reported at the Nov. 16 ADM). It should also be continually reviewed. This might require some future changes in farm production, where we run a standard reco on all events, and a "monitor" reco, run on a fraction of events.

6)      The alignment group may provide in the next few weeks a first pass new alignment (that would imply a new p10 pass release). This would include
a) SMT north vs SMT south alignment,
b) ladder by ladder alignment within individual SMT barrels,
c) SMT relative barrel alignment,
d) CFT ribbon alignment (for limited detector),
e) SMT vs CFT alignment.

7)      We will eventually need to make a connection between reco and the database server(s). This currently includes magnet configuration and SMT calibration tables. There is currently no schedule for this migration, although this functionality has been fully tested. The only reason for not attempting this deployment has been priority issues with other changes.

L3

Nov 2001 Level3 Monthly Report
==================================

Bitwise Chunk Comparator...........................................Han Do
The l3chunkcomp package successfully selects the online and offline
L3chunks makes a comparison of first length and then contents of the two.
It returns a bool reflecting whether the 2 chunks are equal.  This is a
first step needed for data (and tsim_l3) verification: does L3 produce
ientical results when run on the same data under the same trigger
conditions online and offline.

ScriptRunner................................................Moacyr Souza
Since triggerlists often are prepared with a specific release in mind (new
tools or filters, or new functionality of old) we discussed the need to
add a cross check (at triggerlist initialization) that the correct exe is
downloaded.  This would mean finding a means of having ScriptRunner
know its release number (such a check may have to be short-circuited
offline under trigsim).  No simple mechanism for this has been identified.

Manpower Additions........................................................
Terry Wyatt [Manchester] has agreed to assume the role of L3 co-leader
when Moacyr steps down at the end of the year.

Tool Reports
------------
Jet......................................................Volker Beuscher
Rejection estimates have been made from the Mark&Pass runs
JT_HI [CJT(1,10)JET_15] pass rate:  5.4%  -> Rejection=18.5
JT_LO [CJT(1,5) JET_10] pass rate: 11.4%  -> Rejection= 8.8

L3Ele..........................................................Ia Iashvili
Efficiency plots from the Oct 6-7 Mark&Pass runs of jet filters with
emfrac requirements show nice turn-ons at the expected L3 thresholds.
Efficiencies appear to level off at ~80%, until offline good em
requirements are imposed, when the efficiency becomes ~100%. Rejection
rates (after L1) are estimated at
EM_HI [CEM(1,10)JET_EM15] pass rate:18.2% -> Rejection= 5.5
EM_LO [CEM(1,5) JET_EM10] pass rate: 6.7% -> Rejection=15.0

CFT Tracking................................................Ray Beuselink
The existing state of the detector has pointed up a deficiency in the
implementation of the link and tree algorithm. Although spanning the
missing layers 7 and 11, no hits are picked up from layer 3 (since layer 1
has no data).  In order to start elementary trees in layer 5 if there are
no hits in layer 1, will require some code redesign.


SMT Unpacking/clustering................Daniela Bauer, Robert Illingworth
The tool has been written to allow parameterised pedestal values
(approximated across a chip by a fourth order polynomial). Daniela has
tried calculating the parameterisations (working for a subset of the
chips), but database access problems persist.  For the time being we use
the pedestal values generated a couple of months ago by the offline group.

Still to do: put in the adjustment for charge drift in the magnetic field,
(which is a function of solenoid polarity), and add a noisy strip-killing
routine.

Global/SMT Tracking.......................................Daniel Whiteson
Have looked at real data with SMT-only tracking.  The distribution of
tracks vs phi is not flat, and is clearly related to a sinusoidal
dependence of DCA with phi.  This is seen offline as well, attributed to
the 400 micron beamspot spread and its off-center location within the
detector. Work has begun on a filter requiring multiple tracks emerging
from a common vertex, rather than merely requiring a high-pt track
(rejection is only modest due to high level of SMT noise.

Muon............................................................Paul Balm
The November goal is to filter on local muon tracks, with expansion into a
global muon tool to be staged.  To run in Novemebr at all the unpacker
must integrate Scott Snyder's fix to make it indepedent of the
still-rapidly changing configuration files.  Martin Wegner [Aaachen] has
begun that verification.  Frederic Deliot [Saclay] is investigating the
memory leak.

L3Vertex...................................................Guilherme Lima
Has yet to be run on real data.  Its certification on MC data awaits some
codeing.  Since its port to NT, all performance analysis tools (with their
dependencies on unported code) have been turned off.  They have been
replaced with (untested) l3fanalyze code.

L3Propagator..............................................Arnaud Duperrin
As requested, have compared Geant hit coordinates to those extrapolated
by l3fpropagator, as well as l3 to offline.
From CFT to CPS: x,y,z resolutions are ~100-200 micron
From CFT to Muon: ~3-4 cm
(which are similar to resolutions for the Offline propagator).
Differences between l3/offline propagators are less than 30 microns.


 
 
 
 
 
 

---

FNAL Farms

         Mike Diesburg

---

Graphics

 

---

Infrastructure

During the last month we have continued to make regular "test" releases, one per
week with both debug and maxopt versions of each on Linux and IRIX.
                frozen
  t01.64.00     Oct 16
  t01.65.00     Oct 23
  t01.66.00     Oct 31
  t01.67.00     Nov  4
  t01.68.00     Nov 11 *But* it's a *really* bad one.

Production Pass Releases:
                frozen
  p10.07.00     Oct 12
  p10.07.01     Oct 16
  p10.08.00     abandoned
  p10.08.01     Oct 31
  p10.09.00     Nov 11

NT:  none

OSF: onl01.63.00

Linux, RH7.1 (for ClueD0 mostly)
    We are still having tremendous difficulty doing builds on RH7.1. The
symptoms now are that when we are doing builds, the disk cache grows until it
fills all of memory, then the programs start to swap. Once this happens it's
only a matter of time until the machine effectively hangs. *IF* you are logged
into the machine and are running the builds from a live terminal window (*not*
in the background) you can ^C the process and the machine will eventually
recover (20 minutes+). The behavior is not reproduceable in detail. On our new
build machine (8 cpus, 8GB of memory) we have managed to get an entire build
done using 8x4 parallelism (*really* beating it up), but have hung doing a 2x1.
At the present time, it appears that we can do up to 4x4 through the lib phase,
but as soon as we start doing any linking, we have to dial back to 1x1. NOTE:
this means that we can't start debug and maxopt or do any p releases at the same
time as t releases. We are effectively limited to *1* of the *8* cpus on the new
machine.
    Paul is trying to understand what's going on. But it's slow.
    The upshot is that we are severly limited in the 7.1 releases that we can
supply.

Build Resources:
  Build Machines
    Domino is becoming a real problem for us. It is extremely slow. This is a
problem for everyone, not just us.
    RH6.1 d0lxbld4 is fine. Builds, if we don't do too many at once are done in
<12 hours. It's only when we try to do 2 "t" builds + 2 "p" builds that we get
into trouble. We rarely have the resources to do special builds without
impacting the normal build.
    RH7.1 d0lomite has lots of resources. We just can't use them :-( See above.

  Memory
    Looks fine now that all the Linux boxes have had new memory installed.

  Disks
    We have received all the requested disks. That should hold us for a month or
so :-(
    We now have:
       d0mino    278GB + 36GB for tarfiles
       d0lxbld4  211GB RH6.1 served to d0lxbld1/3
       d0lomite  215GB RH7.1 served to d0lxbld9
       d02ka     275GB RH7.1 builds served to clued0 etc

---

Online:

Online status report, 12-Nov-01


- Kerberized interactive Online systems (Linux, Tru64),
  restricted user access to Kerberos authentication

- Working on Computing Operation Readiness Clearance
    - have done network isolation tests, including cold
      start while completely isolated (may need 1 more
      test of calibration while isolated)
    - preparing disaster recovery plans
    - more to come on Access Control List restrictions
    - working on external access to Windows machines

- Shutdown activities:
    - improvements in 1553 interface software
    - alarm system enhancements

- Tested data copying to STKen

- Installing 8 new Linux Networx nodes for general use
  (EXAMINEs mainly, though a couple may be swiped for other
  uses)

- Still looking/hoping/dreaming for EXAMINE czar

- Run2B TDR
   - looking for $1M over 5 years, mostly for replacements
      of obsolete systems
    - likely will get only small fraction of this...
 

---

Remote Farms
 The farms are currently running smoothly. Requests are being processed
and the results being stored in sam.

Software: mcp10

1.01 M reco events in sam from phase mcp10.

The current release of software on the farms is as follows:
  Generators:     p10.06.01
  Dogstar:        p10.06.01
  D0sim:          p10.06.01
  d0reco:         p10.08.01
  recoanalyze:    p10.08.01
  mc_runjob:      v03-03-13
  cardfiles:      v00-03-05
  MagField:       v00-01-00

We are moving to a request based system and have a web page showing
current requests and identity of farms carrying out the processing. This
is available at
http://www-d0.fnal.gov/computing/mcprod/Requests/Requests.html

Some remote site problems with sam stations is slowing down transfers.

The farms are running reco certification jobs for each release of reco.
p10.08.01 and p10.09.00 are the most recent examples. The results are
being stored in phase recocert.

The new metadata system is nearly fully implemented and the request
system will follow. Testing should begin soon.

---

Remote Analysis:

Jae Yu

I haven't had any time to do anything beyond what I have reported in the IB meeting on my slide, soon to be posted on our web page. 

I've got Gordon to work on build error log and John Ellison to work on more generalizing his tools.    I am in the process of setting up a "remote-analysis" web page but is not ready yet.   Due to anticipated multiple stations participating the bi-weekly meeting, I have contacted Sheila to find out how many ports are available and what other technology we can use to enable active participation. 

 

---

SAM Data Handling System

SAM

Statistics

428 registered SAM users in production
      283 of them have at some time run at least one SAM project
      267 of them have run a SAM project at some time in the past year
      181 of them have run a SAM project in the past 2 months
222 registered nodes
150,847 cached files on disk somewhere
146,908 of them on d0mino
 1299 on d0lxac1
 2301 on a clued0 node
 337 on imperial college test machine in the UK
 503 on linux build machine
281,066 data files known to SAM
 43,534 raw files  (all stored on tape)
 78,463 reconstructed files  (76,305 of them actually stored)
 19,700 root-tuple files


Issues and status

Opening 010129: Offer to candidate was rejected by candidate. We are
interviewing additional candidates this week.

Gabrielle and his 2 students, Friedemann Lindemann and Frank
Strzelczyk,  have put together a wonderful
presentation for the display. Vicky, Lee, Sinisa, and Gabrielle  at
SC2001 last week in Denver.

Tape problems should be under control. Fully converted to STK and LTO
now for on-line, MC and group data.
CORBA naming server has caused problems in past. We are testing  a new
naming service with persistency that
should resolve this. Plan to deploy this month, Nov 26.   Some queries
have caused  the system to jam.
We have split user db server away from the dbserver for the stations.
Looking into how to deal with long
(usually event picking) queries. This has been fixed by Matt User
support is sometimes slower than people like:
We are training many Dzero volunteers to help Lauri is available  at
Dzero every Wednesday on DAB5 (my office).
She has not been overwhelmed by walk-ins.

Working through the work list available at
http://d0db-dev.fnal.gov/sam/doc/plans/sam_v3_2_1_tasklist.html . Behind
on this, had planned
to complete by Nov 8 but farm testing and MC project still being done...
Now hope to complete all this by Dec 1. Starting new list for next
6 weeks. We need to fill the open position asap.




 

---

Databases


Calibration DB apps

All calibration database applications are building and testing the
complete chain that gets the data to the final reco application.
This chain  includes online db> offline-staging >offline db>dbserver >
D0om-CORBA > reco-calibrator.
SMT has been tested with multiple test clients running in parallel on
farm nodes, and is now being tested with many dozen reco clients on the
farm.
Other calibration applications that are close include Muon, and FPS. CFT
and CPS are not far behind. Calorimeter has a ways to go.
Major difficulty has been naming service. This is being addressed with a
new naming service to be deployed later this month.
Plan to use 2 linux boxes returning from SC2001 for db server boxes.
Dave & co will help set this up and test
over the next few weeks.

RCP review was held November 5 to discuss status and schedule for this
application. Issues discussed include:
Review of code
Urgency of  need for the product
Status and schedule for completion
Long term maintenance
Trigger database
Entering the simulation 'p9' trigger list
Muon filters have been added at L2.
MarkAndPass added to all scripts at L3.
Discussion with L3 people InRe proposal for tracking versions with
releases.
Still some problems getting XML output to work with the parser in L3.
Jim's review doc at
http://www-cdserver.fnal.gov/cd_public/cpd/aps/rcp_corba_review

Luminosity DB

Jeremy has completed the luminosity access packages needed for Nov 17,
Minus L3.
Added transaction functionality, Added get methods for lbn, Updated all
method for new schema
Updated schema for new sync concept, Released a new lumReader.py that
can load all current block files (330000+)
Added error handling for missing keys, L3 is turned off, Added new docs
and schema design to devel website, along with new queries
http://www-d0ol.fnal.gov:8508/lm_db/

Remaining work for Luminosity Nov 17:
Fix L3 in schema, access api, and lumReader.py, Backload all relevent
data
Help Gregor get v1 of lmDb out and working, Cut the database and
software to Production online
Add to website docs and related summary queries

Run Configuration

Offline database is  being updated from online. Connected to sam through
run. Parameters added as sam dimensions can be queried using  sam tools,
and datasets can be created.

Adding storage to d00ra1

As of 10/30, there were 52M events in the SAM file catalog. This
requires 28 GB. In "normal" running we expect to add ~50M/month.
This is consistent with our early estimate of needing 1GB/day to the
database.
Also, there are storage needs for all other applications, the largest
being luminosity.
Based on this, we are  purchasing 500 GB disk to add to our production
RAID system (needed by March).



---

Simulation

 

 
 Simulation Status (Nov. 2001)
=============================

D0gstar:
--------
Created a new libsimpp_info.a for standalone SimEventInfoChunk,
so that the L3 code only needs to link with simpp_info and do
not need to link with simpp_evt any more.
This modification is fully backward compatible.


D0sim and D0Raw2Sim:
--------------------
A lot of efforts put into debug the D0Raw2Sim and D0Sim to
make the pileup be able to overlay the zero-bias data on MC
events. Lisa has found the problem with smt's crash. It is
due to the hits on both sides. Mike H. has found the problem
with cft crash due to the head version of cvs was not released.
But all these debug and fixes are using MC raw data. The real
zero-bias data taken before shut-down have hardware problem
and could not be used to test D0Raw2Sim. We have requested
new zero-bias runs to be taken after shut-down is over.


D0TrigSim (Dugan Oneil):
------------------------
Significant progress has been made in using the full MC
trigger list from the trigger database. 250k events were run through
d0trigsim from p10.09 with the full L1 and L2 lists and a subset of the L3
list which focussed on jets and electrons (for tool certification
studies). This required patching of the L1 framework, L2Global and L3
parser, among other fixes. L1Cal rates derived from these samples were
compared to pre-shutdown rates from real data and were found to be in good
agreement