Computing and Software Status

December 2001

·Algorithms
 

RECO
L3

·FNAL FARMs

·Graphics

·Infrastucture

·Online

Remote Analysis

·Remote Production

·SAM

·Databases

·Simlation

---

Algorithms

RECO

     Harry Melanson

No report filed

L3

November 2001 Level3 Monthly Report
===================================

Manpower Additions........................................................
Martin Wegner (Aachen) takes over the development of the L3 muon tool.
Viktor Koreshev (IHEP,Russia) has authored a L3TScint package for L3muo
with which a narrow scinitillator time window can further constrain track
parameters. Diptansu Das (KSU) begins work on both l3fCalMEt and
l3fJetMEt. Han Do returns home to Vietnam, but would like to continue
development of the bitwise comparator.  Moacyr Souza ends his long
association with L3 (principally as architect and author on ScriptRunner,
numerous filters, and the L3 monitoring code) at the end of this month.
L3 has been exraordinarily fortunate in having had Moacyr's remarkable
talent and extensive software experience (he is likely irreplacable and
will be sorely missed!).

Tool Reports
------------
L3Ele.........................................................Ia Iashvili
Implemented by filtering on highly electromagnetic jets for studies during
the final runs before shutdown (132947-133014), 375k events were collected
with the following L1 trigger/L3 filters:
EM_LOW:CEM(1,5) Pt>7 GeV emfrac>0.9     Rej=15 (emfrac alone=2.7)
EM_HIGH:CEM(1,10) Pt>11 GeV emfrac>0.9 Rej=5.5 (emfrac alone 3.6)

Applying (offline) the "Good EM" selection criteria (EM frac>0.9,
Isolation<0.2, HM41<200, |eta|<0.8) shows excellent efficiency.

The plan is to run a fully functional L3Tele tool following the shutdown,
initially applying only a cut on emfrac (which should duplicate the
coniditions run under at shutdown) and introduce an isolation cut as
additional rejection proves necessary.  Offline, harder shower shape cuts
will be studied with Mark&Passed data.

CFT Tracking................................................Ray Beuselink
Suitable MC needed to complete efficiency studies in the absence of simple
associatioon between findable tracks and CFT clusters.  Running on recent
(real) data files, no CFT track candidates found at all.  Current
algorithm fails if the innermost CFT layer is missing (this is merely an
implementation problem, since 5 axial layers should still be sufficient).
This fix joins some final coding being completed now.  New student
manpower has been identified, and certifications efforts should begin
shortly.  It may be late January before they are complete.

Global/SMT Tracking.......................................Daniel Whiteson
Muon id discussions made it clear their current filtering strategy could
be enahnced with the possibility of a central track requirement at Level
3. In reponse Daniel studied the feasibility of a single track (smt-only)
filter, using 13000 events from 132167,132168 (magnet on) runs.  A DCA
selection provides a highly pure track sample.  Rejection improvements
were made by employing a simple vertexer (using SMT-only tracks to filter
data).  Each event is searched for 2 or more coincident (in Z) tracks at
the DCA. Net efficiencies are small (~20% or less) but gain with huge QCD
rejection.

Muon............................................Paul Balm, Martin Wegner
L3TMuoUnpack has run successfully online, but is not yet independent of
the rapidly changing configuration files.  The fix provided by Scott
Snyder still must be implemented and checked.  The interim plan of running
a filter on local muon track reconstruction only, must wait the aboev fix.
A skeletal version of L3TMuon exists which calls L3TMC and
L3TmuoCentralMatch (the later calling a central track tool).  The later
has been extensively studies with 1000-event 5 GeV Pt single muon sample,
but not yet on real data. The performance studies (and certification) of
these features has begun.

Timing results (J/psi sample, p10.06.00 on a 1 GHz clued0 node): unpacking
of SMT+CFT+Muon = 11+7+3 ms = 21 ms/event on average. GlobalTracker takes
7 ms/evt, CFTTracker takes >133 ms/evt. Muon tracking takes 66 ms/evt
(track matching ~4 ms/event). So while local muons only come in under
~70msec, a Global Track match requires ~95 msec.

The muon ID group has identified a number of data sets for L3Muon
certification including raw data taken with L1CalMuonTrig (235k events), a
single muon MC sample (5000 events flat between 5 and 100 GeV), and
"physics samples" with muons (WZ and B physics group's raw data files).
They have also compiled a working list of filtering parameters to be
implemented in the full L3Tmuon tool (currently the "local muon filter"
only cuts on "Minimum pT" and "number of track segments".

A proposal has been made for "Tight", "Loose", "Astub" quality cuts.
The offline efficiency for these is: ~60% (loose), ~30% (tight).

L3Propagator..............................................Arnaud Duperrin
Cross-check studies have compared this fast (~0.1msec) propagator to
offline and GEANT propagation (using the non-uniformn magnetic field).
Residuals are within offline errors (exmaple: CFT->Muon extrapolation:
sigmas~3-4cm).

L3CPS...................................................
The CPS tool simply awaits data. Once readout has been commisioned it
should take ~1-2 weeks to implement the CPS tool into an electron/photon
tool. A study of J/psi's shows that after L1/L2, one has 86-88%
effieciency for these low pt electrons. For the QCD20 sample, passing
L1/L2, a factor of 17 rejection is observed (for a track/CPS/cal match).
Regional clustering takes ~2 ms and unpacking 5 ms.

L3MEt..................................................Lee Sawyer
Online running of L3MEt still pending on certification studies.
Recent d0sim output suggests weighting problems (accounting for a shift
between tower and cell energies).  Still assumes the nominal (0,0)
interaction point.

L3Tau.......................................................Yann Coadou
The current strategy calls for a tau trigger in the trigger list by the
end of December. It would not be fully optimized but would Mark&Pass,
hanging in parellel with existing filters (thuis runngin innocuously).
This would permit tuning studies under run conditions (rather than waiting
on p11). A more complete study (including tracks) would then be available
early next year I guess.  Certification (particularly running on real
data as part of a real triggerlist) must first be run. 
 
 
 
 

---

FNAL Farms

         Mike Diesburg

Farm Report, Dec 12th 2001

Currently running release p10.11.00 on the FNAL farms.
We are attempting to run any all_stream data on farms rather
than specific shiftsets.  This has lead to some problems
with some special run conditions which crash reco.  We don't
yet have a good handle of the impact of this on production.

Latest versions of SAM software were installed the
week of Nov 19th.   This included new name service which
should improve farm efficiency.   We think this is now stable,
but it required over two weeks of concentrated effort on the
part of farm and SAM people to stabilize it.

A separate DB server dedicated to farm operations was
installed on Dec 10th in an effort to isolate farms from
DB overload and network problems.

A new interface to the farm production system
has been implemented.   This provides both a web interface
to request approval, submission , and status.   It can be
found at:

http://d0db.fnal.gov/sam_farm_request

Go ahead was sent to Eternal Graphics on Dec 5th to
proceed with filling order for 32 new nodes.  Test nodes
are still running and have shown no problems to date.   We do
not yet have a firm delivery date for the new nodes.  It is
unlikely they will be installed before Jan 1st at this point.

---

Graphics

 

---

Infrastructure

During the last month we have continued to make regular "test" releases, one per
week with both debug and maxopt versions of each on Linux and IRIX. We have also
been trying to do at least a debug version for Linux RH 7.1.

                frozen
  t01.69.00     Nov 19
  t01.70.00     Nov 26  inc RH7.1 debug
  t01.71.00     Dec  3  inc RH7.1 debug
  t01.72.00     Dec 11  (will inc RH7.1 debug)

Production Pass Releases:
                frozen
  p10.10.00     Nov 16    
  p10.11.00     Nov 26
  p10.12.00     Dec 10

NT:  none

OSF: onl01.67.00

Linux, RH7.1 (for ClueD0 mostly)
    For most of the month we were limited to doing RH7.1 builds on the smaller
d0lxbld* machines building over NFS to the disks on d02ka. This is/was extremely
slow. It takes nearly a day and a half to do a full build from scratch. However,
on Friday, Dec 7, the newest Linux kernal 2.4.16 with SGI's XFS (file system)
patches was installed on d0lomite. Over the weekend Paul ran a full build on it
with more parallelism that we'd normally use and it *completed*! It looks like
they've solved the resource allocation problem where they weren't able to
recover memory from the disk cache. We have just begun building routinely on
d0lomite. This week and next (when we do another production release) will tell
the tale.

We are now able to distribute RH7.1 versions (debug only) remotely, but there is
a hand patch that needs to go in for anyone who needs to link any online
packages or rebuild the entire thing.

Build Resources:
  Build Machines
    Same as last month: Domino is becoming a real problem for us. It is
extremely slow. This is a problem for everyone, not just us.
    RH6.1 d0lxbld4 is fine. Builds, if we don't do too many at once are done in
<12 hours. It's only when we try to do 2 "t" builds + 2 "p" builds that we get
into trouble. We rarely have the resources to do special builds without
impacting the normal build.
    RH7.1 d0lomite has lots of resources. We hope we can now use them.

  Memory
    No change since last month. Looks fine.

  Disks
    No change since last month.

    We now have:
       d0mino    278GB + 36GB for tarfiles
       d0lxbld4  211GB RH6.1 served to d0lxbld1/3
       d0lomite  215GB RH7.1 served to d0lxbld9
       d02ka     275GB RH7.1 builds served to clued0 etc

---

Online:

                          Online Status and Issues

                                    12-Dec-2001

 

- EXAMINEs (next week's topic)

 

- Learning 7.1 environment

    - Some apps have been built

    - Now mixture of 6.1.1 and 7.1 nodes

      - but share common release disks, so

        hard to have both binary versions

      - needs some effort to organize

 

- Remain behind in releases (at t01.67.00)

    - EXAMINE people would like more

    - no one to keep on top of this

      - main need is to manage resources

 

- Resources

    - More applications crawling out from under the rocks...

      - mostly EXAMINE style

      - don't have a home for them...

      - not easy to predict loads

    - Sloppy users

      - Disks always full

      - Orphaned jobs left running

        - aggravated by security restrictions

    - Don't know budget allocation

-         want to purchase ~20 nodes

- but don't have room on FCH2...

 

- Logging rates

    - Have yet to see anything near design input rates, so

      have not stressed system

    - Plan to rearrange DAQ processes using new 7.1 nodes

      (will free up d0olc, main DAQ node)

 

- Network

    - L3 changes have mixed up VLAN schemes

     -Routing issues to STK problematic

    - NFS of pnfs problematic

 

Observations:

- DAQ

  - completeness: crates are often left out of run because

    of readout errors

  - efficiency low

    - above crate readout issues

    - L3 VRC problems

 

- We haven't reached the point where eveything working is the

  "norm", hence don't see pressure to fix remaining aggravations

 

- Similarly haven't reached point where alarm system can be

  usefully employed - too much hardware is problematic 

---

Remote Farms
 
The farms are currently running smoothly. Requests are being processed and the results being stored in sam.

Software: mcp10

 4.3M reco events in sam from phase mcp10 and reco certification samples.
(See http://www-d0.fnal.gov/computing/mcprod/Dec_Stats.htm for details)

The current release of software on the farms is as follows:
  Generators:     p10.06.01
  Dogstar:          p10.06.01
  D0sim:            p10.06.01
  d0reco:           p10.08.01
  recoanalyze:    p10.08.01
  mc_runjob:     v03-03-13
  cardfiles:         v00-03-09
  MagField:       v00-01-00

We are moving to a request based system and have a web page showing
current requests and identity of farms carrying out the processing. This
is available at
http://www-d0.fnal.gov/computing/mcprod/Requests/Requests.html

mc_runjob has now been adapted to allow choice of rcp files in
reco and reco_analyze (for Jet Energy Scale) . These RCP files have to
be in the release. These will be stored in the mc_runjob documentation and the
switchboard files.

Some continuing SAM problems.

The farms are running reco certification jobs for each release of reco.
p10.08.01 and p10.09.00, p10.10.00, p10.11.00 have been completed.
The results are being stored in phase recocert.

The new metadata system is being tested and will be released soon.

 


system will follow. Testing should begin soon.

---

Remote Analysis:

 

---

SAM Data Handling System

SamStatusReport20011212


Major success
1.Distributed station
2.new naming service
3.organizing d0grid
4.All data going to 9940 and LTO, M2 backup

We have made significant progress this month in getting the distributed
SAM station to function properly. It is now being used on the farm for
production and we plan to install it on clued0 this week. Chris has done
a wonderful job testing, working closely with Heidi, and understanding
all the problems, and  Igor has spent a significant amount of his time
debugging and releasing. Late last month we deployed the Orbacus 3.4.x
name server. This was a major improvement to the reliability of the
system, partly because many problems were fixed over the 3.0.0 version
we were using, but also this version includes a logging mechanism that
stores in a file the clients that are registered. If there is a problem
and the naming service is interrupted, it recovers in its most recent
state and no re-registration is needed. Sinisa has spent significant
time getting this all working and the transition coordinated.

Iain and Vicky have written the  needed documents to request European
funding for GridPP. This document lays out a plan for SAM grid work for
the next two years. We had our first regular D0Grid meeting this Monday
and discussed everyone's role, and our collective goals. Igor  holds
FNAL Grid-team meetings Tuesday @10:30.

We have transitioned all data to the new tapes and so far all has been
very smooth. So far we have written approximately 40 tapes each. The
user groups set up so far to write to LTO tapes include: higgs, top,
bphysics,np,tau, and emid. I think this works well, and have received
confirmation that many of these groups have stored data into sam.
However, I do not see much data actually there so there may be problems
or complications they are not telling me about.  Several people have
tried backing up project areas to M2.

Close:
1.MC request
2.Shift coordination

We continue to move toward getting the MC request system in place.
Carmenita has finished most of the coding and has been testing with the
new description files Dave Evans and Iain have provided. We must get
this done before Christmas and it is already late. There is not really
that much work remaining, but it all needs to be exercised and iterated.

Lauri has provided a web based FAQ tool for the shifters, but so far
they have not added any FAQs. This shift herding takes more time than I
have, and my goal is to get the new co-leader involved with this. Jenny
Chen (summer student)  will be here over Christmas and she will help me
with some shift tools.


Problems:

1.SAM problems
2.Tape problems
3.Urgent items not done

We have observed several problems over the last week that have
compromised the functionality of the system. We have noticed the name
server "stalling" for some unknown reason. Also, we have seen that the
archive log files in the database are extremely active. The activity in
the archive log files has been identified as SMT calibration work that
John Weigand has been doing. John has stopped for now to see how this
contributes to our problems,  and we will figure out how to resolve
this. The only other thing we know of  that has been added in the last
week is the farm processing. To determine if this is causing problems to
the overall system we have installed a dedicated db server for the farm
and we will watch this closely. Meanwhile, Sinisa is debugging the
naming server to try and identify  what the culprit might be. We
observed no problems in extensive testing in development, so we are
trying to think of what is different about production. It could be
something about ora1, or some network issue but this is not clear.

We observed a minor glitch over the weekend when several 9940 tapes went
noaccess. This was caused by some tape loading activity on Friday and
the system being left in an unusual state that made enstore think the
existing tapes being requested were not in the correct slots. In the
future, this should be dealt with by having the operators page the
enstore primary. I screwed this up this weekend by (somehow) thinking a
HD ticket submitted as "urgent" would actually be looked at.

There are many items still on the task list for this year, but the two
that come to mind are 1. marking file status, and 2.tracking framework
mode. I know these are important and we have discussed them within the
group. Other more urgent things, many of them operational, have
superceded these so far.



 

---

Databases


 

---

Simulation

Simulation Status (Dec., 2001)
 ==============================

 D0gstar:
 --------
 Updated muon .tz file (provided by Tom Diehl) to agree
 with the real muon geometry.

 D0Raw2Sim:
 ----------
 All required packages are in. The package making a L1CalTTChunk is
 not yet working properly. Person providing necessary code is busy with
 higher priority items. Everything else seems ok but it still needs to
 be run with complete post-shutdown data. A zero-bias run is requested
 and approved, but wait for full detector of CFT+SMT.

 D0Sim:
 ------
 The program is basically complete but its ability to merge
 real data with MC data has not been fully tested yet as proper
 D0Raw2Sim output files do not exist.

 TrigSim (by Oneil):
 -------------------
 In the latest production release (p10.12.00) we can run the full MC
 trigger list at all three trigger levels. The coorsim output files for L1
 and L2 still need some hand editing due to minor bugs in the database and
 changes to online L1Muon and FT And/Or term naming conventions. However,
 the hand-edits are simple enough that they are not a major hinderance for
 a skilled user. Hand-edits of the standard list have been produced and are
 available from the trigsim webpage.