Computing and Software Status

February 2002

·Algorithms
 

RECO
L3

·FNAL FARMs

·Graphics

·Infrastucture

·Online

Remote Analysis

·Remote Production

·SAM

·Databases

·Simlation

---

Algorithms

RECO

     Harry Melanson

RECO Status Report - Feb. 2002
Harry Melanson, Algorithms group


The current production version of RECO that is certified for processing real
data is p10.14.01. It was installed on the farms on Feb. 12, 2002 and is
currently being used to reconstruct incoming data and to reprocess the
"Winter, 2002" data set.  The previously used version was p10.11.00.

Official Monte Carlo production is currently using p10.11.00.  Later
versions contain a bug that prevents RECO from being used on farm nodes that
do not have an external network connection.  This bug will be fixed in
p10.15.00 (see below).

p10.14.xx contains several major steps forward for processing real data as
compared to p10.11.00 (see below).  Probably the most significant are
related to track reconstruction.  For the first time, we have a chance of
finding "believable" CFT tracks in post-shutdown data.  In addition, the
quality of SMT tracks has improved.  Having said that, it must be stressed
that there are still many issues to study / resolve.  As an example,
p10.15.00 will contain another fix to the CFT electronics readout map, which
results in 2 more CFT tracks per event (and more global tracks) (!).  Also,
a complete default set of SMT thresholds will be included, yielding 2 more
SMT tracks per event (!). (!!!)   Upcoming studies will include looking at
CFT stereo layers, SMT internal alignment, CFT internal alignment, SMT
calibration, CFT calibration, etc. etc. etc.  Users of tracks (everyone?)
must understand that we have only begun to explore our new tracking system.
All are encouraged to participate directly (and with dedicated effort) in
the detector groups, the tracking group, the vertexing group and/or the
Tracking Task Force.  There is still a lot of fundamental work to be done
before we have a chance for believable physics results.

Other important features of p10.14.00 as compared to p10.11.00 include
emreco H-matrices tuned using plate MC (and no preshower, as currently
required for real data), a fix for hardwired PDT resolutions resulting in
improved central muon reconstruction, and the enabling of the dynamic mode
in NADA resulting in significantly improved hot cell killing.  The previous
comment about "a lot of fundamental work" should be understood to apply to
all of the components of OUR reconstruction program.  All algorithms and
object id groups are still understaffed and desperately need collaboration
involvement.

We plan to build p10.15.00 during the week of Feb. 18.  The list of changes
is included below.  It will have yet more significant improvements for both
data and Monte Carlo production.

At this time, reco developers are (should be) working on making the p11
release ready for use by the collaboration.  There are several steps
involved.  All improvements that were developed for p10 must be included and
integrated with any new p11 features.  These must be tested, and the results
certified before p11 can be used.  One important new functionality in p11
will be access to the detector calibration databases for real data
processing.  This will require new coordination between detector groups,
shifters, database experts, software developers and farm groups.  Another
major new feature in p11 will be the first version of the thumbnail.  Since
this is the ultimate "physics analysis" interface to the data, it is very
important.

A note to users:  Although there are p11 releases currently being built,
users should understand that neither reco nor reco_analyze have been
certified for use.  An announcement will be made when p11 is generally
useful for "reco".

Because of the huge amount of effort expended to make p10 useful for real
data, we are significantly behind our p11 schedule.  I hope to have a
revised, realistic schedule within two weeks.  It is obvious that we must
converge as quickly as possible on p11, in order to get back on track with
our strategy for quarterly releases.  I plan on developing a p12 reco
schedule in the next two weeks as well.

As an aside, I would like to use this opportunity to explicitly thank ALL of
the members of the Tracking Task Force (TTF), who have worked extremely hard
to understand the data coming out of our new tracking detectors.  We have
had several groups that have expended heroic effort to give OUR experiment
the chance to succeed (ones that come to my mind are the people who actually
BUILT the silicon or fiber tracker detectors, or the crew trying to get all
of the AFE boards installed, or the DAQ "guys" giving us REAL rate - others
are probably obvious to you so please include them in "my" list). But let me
make two distinctions.  First, I believe the TTF is the first group that had
to cross over all "organizational boundaries" in the experiment.  Problems
related to hardware, electronics, DAQ, online, calibration, software, etc.
etc. etc. were not allowed to be "blamed" on someone else.  They simply had
to be solved (and in some significant cases, WERE SOLVED).  Secondly, the
members of the TTF had to watch as others of the collaboration worked on
"first physics results for Moriond", and listen to complaints such as "NO
ONE understands how to match tracks with the calorimeter" or some other
comment that reflected relatively little understanding of the "current"
understanding of the "detector".  I admire the devotion of these
individuals, and I'm extremely proud to be their collaborators.  When all is
said and done, we stand on the shoulders of these people.

Additional current reco related activities include:

1) The latest version of reco has a significant number of memory leaks.
Although they do not appear to significantly impact current production, an
effort is underway to understand and fix them.  Some of the leaks are
straight forward to solve, and will be fixed in p10.  Others will require
substantial rework of existing code, and will be fixed in p11 or p12.  The
FNAL CD C++ experts are being consulted on a case-by-case basis.

2) Linking the reconstruction executable requires a significant amount of
memory.  The CD C++ experts are being consulted to see if this can be
reduced.  Improvements may be generally applicable to all D0 executables.


More details about the following are available on the reco status page,

http://www-d0.fnal.gov/computing/algorithms/status/p10.html


Changes to be included in p10.15.00
-----------------------------------

* Yet another fix to the CFT / CPS electronics map (2 more CFT tracks per
event + more global tracks).

* Default SMT pedestals for all channels (2 more SMT tracks per event).

* Fix d0omCORBA allowing reco to process MC on a node with no external
network connection

* Fix DSPACK allowing reading files on RH 7.1 (i.e. clued0)

* Fix DSPACK to allowing processing files with events from many runs

* Fix memory leak in cal_nada

* Fix bug in cal weights related to mass-less gaps (goes along with a
d0gstar patch)

* Fix CFT clustering code to automatically select thresholds for MC or data
(fix cures a problem with d0sim)



Major p10.14.xx upgrades since p10.11.00
----------------------------------------

* For the first time, we have a CFT electronics map that is "more or less"
correct.

* In addition, fiber by fiber CFT thresholds taken from online calibration
runs are being used (via flat files).  These reduce the occupancy to
reasonable values, reducing the number of fake CFT tracks.

* The SMT geometry uses the survey alignment constants, and has a correction
for the relative SMT N/S barrel alignment.

* The SMT Lorentz corrections are enabled, and use the run configuration
database to get the correct field polarity.

* Post-shutdown SMT pedestals are being used.

* The tracking group has supplied new paths which extrapolate SMT tracks
into the CFT, allowing for misses.

* The correct nominal beam position for post-shutdown data has been included
in track reconstruction.

* Remove hardwired PDT hit resolution, increasing number of central muons
with matched A and BC segments

* New H-matrices tuned with plate MC (no preshower).

L3

 

---

FNAL Farms

Farm Report for 27 Feb, 2002


Processing with p10.14.01 0reco is complete
through beginnning of February. 
Recotest run of p10.15.00 was successfully
run on Feb 26th.   We will shift normal procesing over
to p10.15.00 as soon as Harry gives us the OK.

New 1GHz farm nodes were turned over to us on
Feb 25th.    We are still running down setup and
configuration problems on these nodes.   We expect they
should be in production by Mar 1st.

Now running with access to both STKen and the
new Mazzanoine D0 robot.   So far looks very good with
no known problems attributable to these systems.

Evaluation of hardware for next farm purchase is
proceeding.  Vendors have ben notified of requirements and
evaluation units should be arriving next week.  Bench mark
sets for the evaluation are being put together.  D0reco
will be run as part of the benchmark tests.

Graphics

 

---

Infrastructure

Infrastructure/Code Management Status: Feb 2002

We've more or less recovered from Christmas.

Since the last report, we have ceased doing routine "t" builds on RH6.2. We are
still doing RH6.2 production builds  since the farms still need it. We have also
ceased doing maxopt "t" builds due to lack of requests and lack of build CPU. So
we are now routinely building "t" releases on Irix, and Linux RH7.1 Debug only
(mostly), but Irix, RH6.2 and RH7.1, debug and Maxopt versions for production.

                frozen
  t02.02.00     Jan 29
  t02.03.00     Feb  4
  t02.04.00     Feb  9
  t02.05.00     Feb 15
  t02.06.00     Feb 25

We are making fairly good progress on production releases. The last couple of
p10 releases have been real "pass" releases in that they are bug and problem fix
releases only, no developement. The p10 releases were abandoned by the trigger
groups as being too old and not worth updating. So p10 is MonteCarlo +
D0Reco/reco_analyze only at this point. The trigger people have switched all
effort to p11. It is p11.01.01 that is running on Level 3 at the present time.
The Reco people are just switching their major effort to p11. We will have close
to 100 modified packages for p11.02.00, starting today. This is a *huge* number
for a production release, but indicates the amount of effort that was going into
p10 and getting the code working on real data. This sort of thing should not
repeat.

Production Pass Releases:
                frozen
  p10.14.00     Feb  4
  p10.14.01     Feb 12
  p10.15.00     Feb 26 (maybe)
 
  p11.01.00     Feb 13
  p11.01.01     Feb 25

OSF: last one was onl01.67.00

Build Resources:
  Build Machines
    The major resource problem we are having these days is lack of machines.
Builds on d0mino have taken as long as two days (44:20 hours) and almost 24
hours on the other two build machines. This is with at least one other build
occuring at the same time. This has gone up from about 8 hours on d0mino about a
year ago when we first installed the parallel build system. The growth in time
has been fairly slow but steady. It can not be attributed to an increase in
packages or anything like that. But it does pretty much parallel the use of
d0mino. However, there should be plenty of resources available on d0mino. So
this is not well understood. The build times on the Linux boxes are pretty much
what we expect. D0lomite (RH7.1, 750MHz, 8 processor) is only a little faster
than d0lxbld4 (RH6.2, 500MHz, 4 processor) because it's using disks nfs mounted
from d02ka. This costs about a factor of two. However, we would like to keep
this arrangement because in this configuration the builds are instantly
available to the entire Linux world at FNAL, ClueD0, in particular.

  Memory
    No problem there.

  Disks
    We recently filled up the d0cvs disk. We quickly found some files to delete
(core files that people had commit etc) and got back on the air. During today's
(2/26) shutdown, the repository was moved to a different larger disk. We are
still (I am told) able to make the rotating nightly copies, copies available
from the last 4 nights at all times. So we should be good for now.

    On /d0dist/dist/ we have:
       d0mino    278GB + 36GB for tarfiles
       d0lxbld4  211GB RH6.1 served to d0lxbld1/3
       d0lomite  uses the disk on d02ka but still have 215GB locally if we need
it. Using the nfs disks slows the builds a lot (30% at least) but it makes the
builds available immediately. So far, this is judged to be more important than
speed.
       d02ka     275GB RH7.1 builds served to clued0 etc


---

Online:

Online Status Report

              2/26/02  S.Fuess

 

- Improvements in the reliability of the VBDi to VRC link have greatly increased overall DAQ rate into Level 3, now somewhere ~100 Hz

 

- L3 output has been restricted to ~25 Hz to match current  Offline capabilities, and additionally to avoid running  up against Online host machine limitations

 

- Setting up 2 Linux boxes with Gb adapters to act as home  for Collector and Distributor processes, thereby giving more resources to Data Logger and hence providing 50 Hz capability.  But, there are problems with Linux Gb drives holding this up.

 

- SBCs are operating in several CFT crates as "virtual VRCs"

 

- Recent tests have VRCs operating as "virtual SBCs".  This is more the migration path where groups of crates/VRCs will be gradually replaced with SBCs.  This configuration requires the functioning of all the pieces of the new DAQ.

 

- Attempting to purchase more Linux systems for Control Room  and monitoring, but a slow process finding appropriate systems.

 

- Progress being made on defining EXAMINE common needs and setting goals for enhancements (trigger selections, begin/end run actions, database connections).

 

- Discussions on needs for more run/detector configuration information

                         

---

Remote Farms
 

---

Remote Analysis:

 

---

SAM Data Handling System

SamStatusReport20011212
SamStatusReport20020226

 

SAM-core

 

The v4 sam is quite stable after a few minor problems with the initial installation on d0mino. There have been several major technical achievements in the last month including 1) setting up remote stations with “parasitic” stagers running on domino, and 2) getting  sam to operate in  the  distributed analysis mode. Both of these employ the distributed caching now built into the sam station. Using parasitic stagers running on domino, remote stations are capable of accessing files in enstore, and  moving  them through d0mino cache  to the remote site. Chris has been working to debug the distributed station on clued0 and has a working system. There is still work needed to achieve the final desired behavior and further cooperation from the clued0 administrators is required.  

 

There have been several problems also. Unfortunately, the much anticipated “fast station revival” was a failure and recovery of the station is slower than ever. This has to be resolved and will take more time from Matt, Igor and Andrew to figure it out. Sinisa has spend considerable time working on a solution to fix a problem that causes common  IDL’s from rcp, sam manager, and calibration packages to clash in the build. This needs consensus from Alan, Paul, Mark and should be finished in the next week.   

 

Using the ability to move data from enstore to remote sites enables us to expand the number of remote stations. Thanks to  the efforts of Lauri and Chris to compile clear and concise instructions, and offer hands-on assistance at the d0RACE workshop,  we now have more than 10 new remote stations, in addition to the dozen or more already existing network.  This is showing up on the network monitors and we are watching for possible problems in may create on D0mino.  We do not plan to register additional stations until we digest this load. There will probably be a major upgrade to sam around the end of June that will incorporate many of the research efforts coming out of sam-grid. Sometime after that we will expand the number of stations again significantly. Many of the sites in the current station list are participating in the D0 sam testbed project. This is both an official grid outgrowth, as well as a vehicle to involve  remote  sites and the FNAL networking group.  This needs more help to organize and  steer the effort.  Testbed sites include the following:

 

uFermilab                                    Batavia, IL

uImperial College                       London,UK

uIN2P3                                         Lyon

uLancaster                                   UK

uMunich                                       DE

uNIKHEF                                    Amsterdam, NL

uPrague                                        CR

uWuppertal                                  DE

uBoston U.                                    Boston, MA

uUniversity of Arizona                AZ

uU. Texas, Arlington                   Arlington, TX

uU. Oklahoma, Langston            Langston, OK

uIndiana U.                                   Bloomington, IN

uLouisiana Tech                           Ruston, LA

uUniversity of Kansas                  Lawrence, KN

uMichigan State University         East Lansing, MI

 

 

 

 

 Work is being organized  to understand networks, and begin testing more extensive SAM Station deployment and operation in systematic fashion.

 

Our new D0 sam team member, Andrew Baranovski, is primarily learning the design and code of the sam station cache manager  and file storage server. There are a few minor fixes and feature additions he will begin to implement soon to get his feet wet. The major project is to merge the function of the station cache management and the file storage server together, and implement data routing in its true final form within the system. This will allow us to set up static routes for data transfers among stations and provide the control required to set up the  remote data center hierarchy we desire.  Also,  additional work is needed to resolve some problems that have caused quasi-deadlock conditions on the farm, and may be issues on clued0.

 

There are some improvements to the shift operation, including a new tracking tool and some help for scheduling shifts. Lauri has built a web tool that allows the shifter for each day/zone to complete a simple “button” checklist. This information is put into our oracle database and preserved. Later we can chart the history for each category monitored and observe which are the more problematic.  Don Coppage (KU) has agreed to help manage the shift list for a while and this will relieve me of some of this responsibility.

 

SAM-tools

 

The tools team is not yet holding regular meetings. Wyatt is collecting information, and so far has talked with Matt Vranicar and Michael Begel.  We need to resolve the personnel  gap, since Carmenita will be unavailable for a few weeks.  Wyatt expects to have a preliminary task list by the end of this  week.

 

 

SAM-grid

 

The Grid team is working to  understand Globus tools, especially GridFTP, GSI, and  MDS. Igor is  studying EUG projects and in particular the  WP1 job submission and resource broker. An adapter for sam to condor has been built and is being debugged. It is now  used  at Imperial College and Wuppertal. They are evaluating  condor-G and GRAM as part of the job submission for the summer milestones.  Sinisa is doing work  in monitoring and information services starting with the evaluation of MDS and he has formed some preliminary opinions. Gabriele is evaluating GridFTP, using certificates with GSI . He has been successful using  DOE Science Grid  certificates of authentication to perform file transfers between FNAL and UK sites (a feat which drew applause at the Global Grid Forum…sad but true!)  They have collected use cases for the job control language and job submission. John Weigand and Gabrielle are providing  SAM data file use stats  for UC Students, and database usage numbers for Koen Holtman to use in the CMS Grid requirements doc. Iain and Dave Evans are now testing MC request system to specify and track MC processing tasks at remote processing sites.

 

We are building and coordinating  our D0GRID collaboration that currently includes FNAL, Imperial College,  Lancaster, Prague, NIKHEF, UTA. There is interest and contributions from many other US and European sites as well.  Phone and video cons are held at least bi-weekly.  Face-to-face meetings are imperative and we take advantage of every opportunity to do this we can. Gabriele Garzoglio started Nov.  last year to work full time on ppdg.  Andrew will  work with SAM Core team to allow more ppdg-related time for Sinisa and Igor. New funding  in GridPP (UK)  project will provide additional manpower for D0GRID at IC, and Lancaster. CDF is evaluating using SAM for data handling and Sinisa has spend quite a bit of time making the infrastructure (sam_bootstrap, sam_config, sam_db_server, and others) configurable to work for  anyone. Lauri has contributed her ups/upd  and other experience to this.

 

 

 

Databases (Drawn largely from Ruth’s notes for the last meeting)

 

1. Database server infrastructure reworking - Steve, Jim, Herb

 

They have done a lot of design work and written a short paper on changes they are  hoping to do. This will impact Doom. Some recoding of dbcore is needed to separate out classes with specific functionality, and they will link through a connection management class. Need to understand the performance and impact on the applications.

 

2. Calibration databases

 

2.1 SMT

 

The SMT offline database tables are in production and have been filled by Taka. The client code is in Reco and ready for production in the p11 release.

 

2.2 Muon

 

This is very close to being ready and should be in the p11 reco release.

 

2.2 CFT/CPS issues

 

Eric Meyers is  looking at the pros and cons of the CFT/CPS merge for the offline. Eric is writing an update to Jeremy's procedure which will move the CPS and should work for both. Zhong-Min and  Eric have something that works for this.

 

2.3 Calorimeter status - Ursula

 

Ursula is making lots of progress and may get to integration in the offline this week. Will adapt online/offline onl_cal_transfer . They are anticipating  30 Gbytes/year for offline and this is considered  fine. Updates for the constants will occur t a couple of times a week and the whole tree will be  put into the database for each calibration set. It took 15 minutes to retrieve a complete set of constants, and if multiple runs are needed for a data set there may be performance issues.

 

3. Trigger database

 

Results of meeting with L2, review/requirements gathering - Elizabeth, others.  Connections between online and offline databases do not allow for l2-software-version. Major versions are made offline release  in time for a new file;  Trigger DB will implement the Major Version number - associates with a set of tools, and filters etc. Dynamic changes run to run made in the online environment and COOR would support them through a resource file. If they want both a major/minor version downloaded then Scott would handle this through a Resource File. Ruth will arrange a follow up meeting with L2. Ron Lipton is expected to   gather the group to coordinate discussion of requirements and policy and procedure.

 

4. Luminosity database status - Jeremy

 

Jeremy finished coding of transfer from luminosity files to database. Space reports have been given  to Diana and Anil but these are wrong - they are much too large. Ready to go production in offline and they need to allocate 10-30 GB.   Offline needs documenting. Next step is to build an offline dbserver and application interface for the information. Jeremy will do a prototype. Need more help and they will ask the Luminosity group for a person for this. It is important to start on Runs Summary database updates for Data Quality. Propose to leave the offline application interface for luminosity and start on this pending a d0 person to do the client side of the luminosity offline application.

 



 


 

---

Databases

D0 Databases Feb Monthly Report
for comment and question...


Production Applications for Reco Requirements-  runs summary database being
used by Reco in production. This means database downtimes affect all reco
developers unless configure the RCP parameters to not use the database.

Calibration Database Applications - Progress is being made on the redesign
of the Calorimeter tables and changes to the applications. Ursula has a
timetable for completing a first pass by the end of March. The merging of
CPS/CFT tables in the online is leading to confusion in the data transfer
and offline database application. This is being sorted out by the
application developers, but at present I do not see a timetable for
production release of CPS/CFT. SMT is in production and has been delivered
to the production release manager.

Luminosity Database - much progress has been made. Jeremy Simmons has
contributed significantly to the succes of this project. The size of the
data stored has been much reduced. The application is in production in the
Online database and will be in production for Offline and Data transfer
within a few weeks.

Trigger database. A review of this application was requested by the CPB.
Progress towards this has included a plenary session at the collaboration
meeting, and 2 meetings  with L2 and Ron where it was agreed on a short term
use and longer term plans. Roger is writing the short term usage which
requires only a small change to the output of the trigger database. Activity
is starting to define a) requirements from the trigger groups b) address the
overhead to changing the trigger list version -  there is a lot of
validation and generation required on any change c) understand the scope and
charge of the review. In the meantime work is continuing and extra help
would still be useful. Ron Lipton, the new trigger commissioning tzar will
be helping with the organization of the requirements/review.

Runs Summary database - this is in use by Reco and analysis programs. Work
has started to understand the specifics of the upgrade that is planned once
the luminosity application is in production in online and offline.

Database server infrastructure - the plan for the remaining work for the
database server infrastructure has been updated and work is proceeding. The
fix for multi-threading is in production for all servers.  The current work
will include the design for the proxy server and caching support needed for
remote analysis.

Online databases  - Vladimir is working with the online commissioning group
on requirements and design for the support of online quality; capture of
epics parameters;

Database Administration and infrastructure - the databases were down several
times for the installation of a new disk set. We are doing a root cause
analysis of why the process was so difficult. Otherwise the databases were
stable. Work is ongoing to understand the best scenarios for backing up and
restoring the offline production database as it gets larger.  Database
servers have been established on a new Linux node which will allow them to
be moved from D0mino in the near future. A failover plan is being written

Comments from Harry Melanson:

 
Hi Ruth,

Some comments from a "consumer".

Reco (p10.14.xx) has been running on the "official real data reconstruction
farms" (i.e. "the FNAL farms") since early February.  This version has been
accessing the run configuration database successfully, with NO (!) reported
problems.  Being able to access that db allows the reconstruction to
correctly apply the SMT Lorentz corrections, which significantly improves
the position resolution of its 3-dimensional hits.  This has been a
long-standing difficulty with previous reconstruction versions, and it is
significant that we don't have to worry about this again.  It is also the
first deployment of a database in "reconstruction production mode", and I
personally see this as a big success.  THANKS to all involved.

The only problem we've experienced since p10.14.xx was deployed happened
"today" (corresponding to a "scheduled" d0mino downtime).  Since reco now
requires the run config db server to be up, when d0mino is unavailable, new
farm jobs can't start.  In addition, no user (including me) can start up a
reco job.  As you identified, moving servers to dedicated (with backup)
servers is the solution.  From reco's perspective, this is important.

The current p11 status for reco has us being able to reproduce p10 results
within three weeks (with p11.03.00).  This means using "flat files" for
dealing with calibration constants for SMT and CFT.  My plan is to then
start to initiate real db connections for detector calibration constants
(with p11.04.00 and beyond).  You indicate that SMT is now ready, but that
there are issues with CFT.  Feedback to you is that if CFT can become ready
within 3-4 weeks, that will be compatible with reco in p11.  This is very
important, if we can pull it off.  Although I know that CPS and CFT are
coupled, it is much more important to get CFT "online" (well, "offline")
than CPS.

The calorimeter calibration application is also important.  If it can be
integrated into the reconstruction program within the next two months, in
the development build, then I think we are "on pace".  This means that
"first pass by the end of March" must happen.

Regards,

  

---

Simulation