Computing and Software Status
February 2002
·Algorithms
RECO
L3
·FNAL FARMs
·Graphics
·Infrastucture
·Online
Remote Analysis
·Remote
Production
·SAM
·Databases
·Simlation

Algorithms
RECO
Harry Melanson
RECO Status Report - Feb. 2002
Harry Melanson, Algorithms group
The current production version of RECO that is certified for processing real
data is p10.14.01. It was installed on the farms on Feb. 12, 2002 and is
currently being used to reconstruct incoming data and to reprocess the
"Winter, 2002" data set. The previously used version was
p10.11.00.
Official Monte Carlo production is currently using p10.11.00. Later
versions contain a bug that prevents RECO from being used on farm nodes that
do not have an external network connection. This bug will be fixed in
p10.15.00 (see below).
p10.14.xx contains several major steps forward for processing real data as
compared to p10.11.00 (see below). Probably the most significant are
related to track reconstruction. For the first time, we have a chance of
finding "believable" CFT tracks in post-shutdown data. In
addition, the
quality of SMT tracks has improved. Having said that, it must be stressed
that there are still many issues to study / resolve. As an example,
p10.15.00 will contain another fix to the CFT electronics readout map, which
results in 2 more CFT tracks per event (and more global tracks) (!).
Also,
a complete default set of SMT thresholds will be included, yielding 2 more
SMT tracks per event (!). (!!!) Upcoming studies will include
looking at
CFT stereo layers, SMT internal alignment, CFT internal alignment, SMT
calibration, CFT calibration, etc. etc. etc. Users of tracks (everyone?)
must understand that we have only begun to explore our new tracking system.
All are encouraged to participate directly (and with dedicated effort) in
the detector groups, the tracking group, the vertexing group and/or the
Tracking Task Force. There is still a lot of fundamental work to be done
before we have a chance for believable physics results.
Other important features of p10.14.00 as compared to p10.11.00 include
emreco H-matrices tuned using plate MC (and no preshower, as currently
required for real data), a fix for hardwired PDT resolutions resulting in
improved central muon reconstruction, and the enabling of the dynamic mode
in NADA resulting in significantly improved hot cell killing. The
previous
comment about "a lot of fundamental work" should be understood to
apply to
all of the components of OUR reconstruction program. All algorithms and
object id groups are still understaffed and desperately need collaboration
involvement.
We plan to build p10.15.00 during the week of Feb. 18. The list of
changes
is included below. It will have yet more significant improvements for
both
data and Monte Carlo production.
At this time, reco developers are (should be) working on making the p11
release ready for use by the collaboration. There are several steps
involved. All improvements that were developed for p10 must be included
and
integrated with any new p11 features. These must be tested, and the
results
certified before p11 can be used. One important new functionality in p11
will be access to the detector calibration databases for real data
processing. This will require new coordination between detector groups,
shifters, database experts, software developers and farm groups. Another
major new feature in p11 will be the first version of the thumbnail.
Since
this is the ultimate "physics analysis" interface to the data, it is
very
important.
A note to users: Although there are p11 releases currently being built,
users should understand that neither reco nor reco_analyze have been
certified for use. An announcement will be made when p11 is generally
useful for "reco".
Because of the huge amount of effort expended to make p10 useful for real
data, we are significantly behind our p11 schedule. I hope to have a
revised, realistic schedule within two weeks. It is obvious that we must
converge as quickly as possible on p11, in order to get back on track with
our strategy for quarterly releases. I plan on developing a p12 reco
schedule in the next two weeks as well.
As an aside, I would like to use this opportunity to explicitly thank ALL of
the members of the Tracking Task Force (TTF), who have worked extremely hard
to understand the data coming out of our new tracking detectors. We have
had several groups that have expended heroic effort to give OUR experiment
the chance to succeed (ones that come to my mind are the people who actually
BUILT the silicon or fiber tracker detectors, or the crew trying to get all
of the AFE boards installed, or the DAQ "guys" giving us REAL rate -
others
are probably obvious to you so please include them in "my" list). But
let me
make two distinctions. First, I believe the TTF is the first group that
had
to cross over all "organizational boundaries" in the experiment.
Problems
related to hardware, electronics, DAQ, online, calibration, software, etc.
etc. etc. were not allowed to be "blamed" on someone else. They
simply had
to be solved (and in some significant cases, WERE SOLVED). Secondly, the
members of the TTF had to watch as others of the collaboration worked on
"first physics results for Moriond", and listen to complaints such as
"NO
ONE understands how to match tracks with the calorimeter" or some other
comment that reflected relatively little understanding of the
"current"
understanding of the "detector". I admire the devotion of these
individuals, and I'm extremely proud to be their collaborators. When all
is
said and done, we stand on the shoulders of these people.
Additional current reco related activities include:
1) The latest version of reco has a significant number of memory leaks.
Although they do not appear to significantly impact current production, an
effort is underway to understand and fix them. Some of the leaks are
straight forward to solve, and will be fixed in p10. Others will require
substantial rework of existing code, and will be fixed in p11 or p12. The
FNAL CD C++ experts are being consulted on a case-by-case basis.
2) Linking the reconstruction executable requires a significant amount of
memory. The CD C++ experts are being consulted to see if this can be
reduced. Improvements may be generally applicable to all D0 executables.
More details about the following are available on the reco status page,
http://www-d0.fnal.gov/computing/algorithms/status/p10.html
Changes to be included in p10.15.00
-----------------------------------
* Yet another fix to the CFT / CPS electronics map (2 more CFT tracks per
event + more global tracks).
* Default SMT pedestals for all channels (2 more SMT tracks per event).
* Fix d0omCORBA allowing reco to process MC on a node with no external
network connection
* Fix DSPACK allowing reading files on RH 7.1 (i.e. clued0)
* Fix DSPACK to allowing processing files with events from many runs
* Fix memory leak in cal_nada
* Fix bug in cal weights related to mass-less gaps (goes along with a
d0gstar patch)
* Fix CFT clustering code to automatically select thresholds for MC or data
(fix cures a problem with d0sim)
Major p10.14.xx upgrades since p10.11.00
----------------------------------------
* For the first time, we have a CFT electronics map that is "more or
less"
correct.
* In addition, fiber by fiber CFT thresholds taken from online calibration
runs are being used (via flat files). These reduce the occupancy to
reasonable values, reducing the number of fake CFT tracks.
* The SMT geometry uses the survey alignment constants, and has a correction
for the relative SMT N/S barrel alignment.
* The SMT Lorentz corrections are enabled, and use the run configuration
database to get the correct field polarity.
* Post-shutdown SMT pedestals are being used.
* The tracking group has supplied new paths which extrapolate SMT tracks
into the CFT, allowing for misses.
* The correct nominal beam position for post-shutdown data has been included
in track reconstruction.
* Remove hardwired PDT hit resolution, increasing number of central muons
with matched A and BC segments
* New H-matrices tuned with plate MC (no preshower).
L3

FNAL Farms
Farm
Report for 27 Feb, 2002
Processing with p10.14.01 0reco is complete
through beginnning of February.
Recotest run of p10.15.00 was successfully
run on Feb 26th. We will shift normal procesing over
to p10.15.00 as soon as Harry gives us the OK.
New 1GHz farm nodes were turned over to us on
Feb 25th. We are still running down setup and
configuration problems on these nodes. We expect they
should be in production by Mar 1st.
Now running with access to both STKen and the
new Mazzanoine D0 robot. So far looks very good with
no known problems attributable to these systems.
Evaluation of hardware for next farm purchase is
proceeding. Vendors have ben notified of requirements and
evaluation units should be arriving next week. Bench mark
sets for the evaluation are being put together. D0reco
will be run as part of the benchmark tests.
Graphics

Infrastructure
Infrastructure/Code
Management Status: Feb 2002
We've more or less recovered from Christmas.
Since the last report, we have ceased doing routine "t" builds on
RH6.2. We are
still doing RH6.2 production builds since the farms still need it. We
have also
ceased doing maxopt "t" builds due to lack of requests and lack of
build CPU. So
we are now routinely building "t" releases on Irix, and Linux RH7.1
Debug only
(mostly), but Irix, RH6.2 and RH7.1, debug and Maxopt versions for production.
frozen
t02.02.00 Jan 29
t02.03.00 Feb 4
t02.04.00 Feb 9
t02.05.00 Feb 15
t02.06.00 Feb 25
We are making fairly good progress on production releases. The last couple of
p10 releases have been real "pass" releases in that they are bug and
problem fix
releases only, no developement. The p10 releases were abandoned by the trigger
groups as being too old and not worth updating. So p10 is MonteCarlo +
D0Reco/reco_analyze only at this point. The trigger people have switched all
effort to p11. It is p11.01.01 that is running on Level 3 at the present time.
The Reco people are just switching their major effort to p11. We will have
close
to 100 modified packages for p11.02.00, starting today. This is a *huge* number
for a production release, but indicates the amount of effort that was going
into
p10 and getting the code working on real data. This sort of thing should not
repeat.
Production Pass Releases:
frozen
p10.14.00 Feb 4
p10.14.01 Feb 12
p10.15.00 Feb 26 (maybe)
p11.01.00 Feb 13
p11.01.01 Feb 25
OSF: last one was onl01.67.00
Build Resources:
Build Machines
The major resource problem we are having these days is lack
of machines.
Builds on d0mino have taken as long as two days (44:20 hours) and almost 24
hours on the other two build machines. This is with at least one other build
occuring at the same time. This has gone up from about 8 hours on d0mino about
a
year ago when we first installed the parallel build system. The growth in time
has been fairly slow but steady. It can not be attributed to an increase in
packages or anything like that. But it does pretty much parallel the use of
d0mino. However, there should be plenty of resources available on d0mino. So
this is not well understood. The build times on the Linux boxes are pretty much
what we expect. D0lomite (RH7.1, 750MHz, 8 processor) is only a little faster
than d0lxbld4 (RH6.2, 500MHz, 4 processor) because it's using disks nfs mounted
from d02ka. This costs about a factor of two. However, we would like to keep
this arrangement because in this configuration the builds are instantly
available to the entire Linux world at FNAL, ClueD0, in particular.
Memory
No problem there.
Disks
We recently filled up the d0cvs disk. We quickly found some
files to delete
(core files that people had commit etc) and got back on the air. During today's
(2/26) shutdown, the repository was moved to a different larger disk. We are
still (I am told) able to make the rotating nightly copies, copies available
from the last 4 nights at all times. So we should be good for now.
On /d0dist/dist/ we have:
d0mino 278GB + 36GB for
tarfiles
d0lxbld4 211GB RH6.1 served to
d0lxbld1/3
d0lomite uses the disk on d02ka but
still have 215GB locally if we need
it. Using the nfs disks slows the builds a lot (30% at least) but it makes the
builds available immediately. So far, this is judged to be more important than
speed.
d02ka 275GB RH7.1
builds served to clued0 etc

Online:
Online
Status Report
2/26/02 S.Fuess
-
Improvements in the reliability of the VBDi to VRC link have greatly increased
overall DAQ rate into Level 3, now somewhere ~100 Hz
-
L3 output has been restricted to ~25 Hz to match current Offline capabilities, and additionally
to avoid running up against Online
host machine limitations
-
Setting up 2 Linux boxes with Gb adapters to act as home for Collector and Distributor
processes, thereby giving more resources to Data Logger and hence providing 50
Hz capability. But, there are problems
with Linux Gb drives holding this up.
-
SBCs are operating in several CFT crates as "virtual VRCs"
-
Recent tests have VRCs operating as "virtual SBCs". This is more the migration path where
groups of crates/VRCs will be gradually replaced with SBCs. This configuration requires the
functioning of all the pieces of the new DAQ.
-
Attempting to purchase more Linux systems for Control Room and monitoring, but a slow process
finding appropriate systems.
-
Progress being made on defining EXAMINE common needs and setting goals for
enhancements (trigger selections, begin/end run actions, database connections).
-
Discussions on needs for more run/detector configuration information
Remote Farms
Remote Analysis:
SAM Data Handling System
SamStatusReport20011212
SamStatusReport20020226
SAM-core
The v4 sam is quite stable after
a few minor problems with the initial installation on d0mino. There have been
several major technical achievements in the last month including 1) setting up
remote stations with “parasitic” stagers running on domino, and 2) getting sam to operate in the distributed analysis mode. Both of these employ the
distributed caching now built into the sam station. Using parasitic stagers
running on domino, remote stations are capable of accessing files in enstore,
and moving them through d0mino cache to the remote site. Chris has been working
to debug the distributed station on clued0 and has a working system. There is
still work needed to achieve the final desired behavior and further cooperation
from the clued0 administrators is required.
There have been several problems
also. Unfortunately, the much anticipated “fast station revival” was a failure
and recovery of the station is slower than ever. This has to be resolved and
will take more time from Matt, Igor and Andrew to figure it out. Sinisa has
spend considerable time working on a solution to fix a problem that causes
common IDL’s from rcp, sam
manager, and calibration packages to clash in the build. This needs consensus
from Alan, Paul, Mark and should be finished in the next week.
Using the ability to move data from enstore to remote
sites enables us to expand the number of remote stations. Thanks to the efforts of Lauri and Chris to
compile clear and concise instructions, and offer hands-on assistance at the
d0RACE workshop, we now have more
than 10 new remote stations, in addition to the dozen or more already existing
network. This is showing up on the
network monitors and we are watching for possible problems in may create on
D0mino. We do not plan to register
additional stations until we digest this load. There will probably be a major
upgrade to sam around the end of June that will incorporate many of the
research efforts coming out of sam-grid. Sometime after that we will expand the
number of stations again significantly. Many of the sites in the current
station list are participating in the D0 sam testbed project. This is both an
official grid outgrowth, as well as a vehicle to involve remote sites and the FNAL networking group. This needs more help to organize
and steer the effort. Testbed sites include the following:
uFermilab
Batavia, IL
uImperial College
London,UK
uIN2P3
Lyon
uLancaster
UK
uMunich
DE
uNIKHEF
Amsterdam, NL
uPrague
CR
uWuppertal DE
uBoston U.
Boston, MA
uUniversity of Arizona
AZ
uU. Texas, Arlington
Arlington, TX
uU. Oklahoma, Langston
Langston, OK
uIndiana U.
Bloomington,
IN
uLouisiana Tech
Ruston, LA
uUniversity of Kansas
Lawrence, KN
uMichigan State University East Lansing, MI
Work is
being organized to understand networks,
and begin testing more extensive SAM Station deployment and operation in
systematic fashion.
Our new D0 sam team member,
Andrew Baranovski, is primarily learning the design and code of the sam station
cache manager and file storage
server. There are a few minor fixes and feature additions he will begin to
implement soon to get his feet wet. The major project is to merge the function
of the station cache management and the file storage server together, and
implement data routing in its true final form within the system. This will
allow us to set up static routes for data transfers among stations and provide
the control required to set up the
remote data center hierarchy we desire. Also,
additional work is needed to resolve some problems that have caused
quasi-deadlock conditions on the farm, and may be issues on clued0.
There are some improvements to
the shift operation, including a new tracking tool and some help for scheduling
shifts. Lauri has built a web tool that allows the shifter for each day/zone to
complete a simple “button” checklist. This information is put into our oracle
database and preserved. Later we can chart the history for each category
monitored and observe which are the more problematic. Don Coppage (KU) has agreed to help manage the shift list
for a while and this will relieve me of some of this responsibility.
SAM-tools
The tools team is not yet holding regular meetings. Wyatt
is collecting information, and so far has talked with Matt Vranicar and Michael
Begel. We need to resolve the
personnel gap, since Carmenita
will be unavailable for a few weeks.
Wyatt expects to have a preliminary task list by the end of this week.
SAM-grid
The Grid team is working
to understand Globus tools,
especially GridFTP, GSI, and MDS.
Igor is studying EUG projects and
in particular the WP1 job
submission and resource broker. An adapter for sam to condor has been built and
is being debugged. It is now
used at Imperial College
and Wuppertal. They are evaluating
condor-G and GRAM as part of the job submission for the summer
milestones. Sinisa is doing
work in monitoring and information
services starting with the evaluation of MDS and he has formed some preliminary
opinions. Gabriele is evaluating GridFTP, using certificates with GSI . He has
been successful using DOE Science
Grid certificates of
authentication to perform file transfers between FNAL and UK sites (a feat
which drew applause at the Global Grid Forum…sad but true!) They have collected use cases for the
job control language and job submission. John Weigand and Gabrielle are
providing SAM data file use
stats for UC Students, and
database usage numbers for Koen Holtman to use in the CMS Grid requirements
doc. Iain and Dave Evans are now testing MC request system to specify and track
MC processing tasks at remote processing sites.
We are building and coordinating our D0GRID collaboration that currently
includes FNAL, Imperial College,
Lancaster, Prague, NIKHEF, UTA. There is interest and contributions from
many other US and European sites as well.
Phone and video cons are held at least bi-weekly. Face-to-face meetings are imperative
and we take advantage of every opportunity to do this we can. Gabriele
Garzoglio started Nov. last year
to work full time on ppdg. Andrew
will work with SAM Core team to
allow more ppdg-related time for Sinisa and Igor. New funding in GridPP (UK) project will provide additional
manpower for D0GRID at IC, and Lancaster. CDF is evaluating using SAM for data
handling and Sinisa has spend quite a bit of time making the infrastructure
(sam_bootstrap, sam_config, sam_db_server, and others) configurable to work
for anyone. Lauri has contributed
her ups/upd and other experience
to this.
Databases (Drawn largely from Ruth’s notes for the last
meeting)
1. Database server infrastructure reworking - Steve, Jim,
Herb
They have done a lot of design work and written a short
paper on changes they are hoping
to do. This will impact Doom. Some recoding of dbcore is needed to separate out
classes with specific functionality, and they will link through a connection
management class. Need to understand the performance and impact on the
applications.
2. Calibration databases
2.1 SMT
The SMT offline database tables are in production and have been
filled by Taka. The client code is in Reco and ready for production in the p11
release.
2.2 Muon
This is very close to being ready and should be in the p11
reco release.
2.2 CFT/CPS issues
Eric Meyers is
looking at the pros and cons of the CFT/CPS merge for the offline. Eric
is writing an update to Jeremy's procedure which will move the CPS and should
work for both. Zhong-Min and Eric
have something that works for this.
2.3 Calorimeter status - Ursula
Ursula is making lots of progress and may get to
integration in the offline this week. Will adapt online/offline
onl_cal_transfer . They are anticipating
30 Gbytes/year for offline and this is considered fine. Updates for the constants will
occur t a couple of times a week and the whole tree will be put into the database for each
calibration set. It took 15 minutes to retrieve a complete set of constants,
and if multiple runs are needed for a data set there may be performance issues.
3. Trigger database
Results of meeting with L2, review/requirements gathering
- Elizabeth, others. Connections
between online and offline databases do not allow for l2-software-version.
Major versions are made offline release
in time for a new file;
Trigger DB will implement the Major Version number - associates with a
set of tools, and filters etc. Dynamic changes run to run made in the online
environment and COOR would support them through a resource file. If they want
both a major/minor version downloaded then Scott would handle this through a
Resource File. Ruth will arrange a follow up meeting with L2. Ron Lipton is
expected to gather the group
to coordinate discussion of requirements and policy and procedure.
4. Luminosity database status - Jeremy
Jeremy finished coding of transfer from luminosity files
to database. Space reports have been given to Diana and Anil but these are wrong - they are much too
large. Ready to go production in offline and they need to allocate 10-30
GB. Offline needs
documenting. Next step is to build an offline dbserver and application
interface for the information. Jeremy will do a prototype. Need more help and
they will ask the Luminosity group for a person for this. It is important to
start on Runs Summary database updates for Data Quality. Propose to leave the
offline application interface for luminosity and start on this pending a d0
person to do the client side of the luminosity offline application.
Databases
D0
Databases Feb Monthly Report
for comment and question...
Production Applications for Reco Requirements- runs summary database
being
used by Reco in production. This means database downtimes affect all reco
developers unless configure the RCP parameters to not use the database.
Calibration Database Applications - Progress is being made on the redesign
of the Calorimeter tables and changes to the applications. Ursula has a
timetable for completing a first pass by the end of March. The merging of
CPS/CFT tables in the online is leading to confusion in the data transfer
and offline database application. This is being sorted out by the
application developers, but at present I do not see a timetable for
production release of CPS/CFT. SMT is in production and has been delivered
to the production release manager.
Luminosity Database - much progress has been made. Jeremy Simmons has
contributed significantly to the succes of this project. The size of the
data stored has been much reduced. The application is in production in the
Online database and will be in production for Offline and Data transfer
within a few weeks.
Trigger database. A review of this application was requested by the CPB.
Progress towards this has included a plenary session at the collaboration
meeting, and 2 meetings with L2 and Ron where it was agreed on a short
term
use and longer term plans. Roger is writing the short term usage which
requires only a small change to the output of the trigger database. Activity
is starting to define a) requirements from the trigger groups b) address the
overhead to changing the trigger list version - there is a lot of
validation and generation required on any change c) understand the scope and
charge of the review. In the meantime work is continuing and extra help
would still be useful. Ron Lipton, the new trigger commissioning tzar will
be helping with the organization of the requirements/review.
Runs Summary database - this is in use by Reco and analysis programs. Work
has started to understand the specifics of the upgrade that is planned once
the luminosity application is in production in online and offline.
Database server infrastructure - the plan for the remaining work for the
database server infrastructure has been updated and work is proceeding. The
fix for multi-threading is in production for all servers. The current
work
will include the design for the proxy server and caching support needed for
remote analysis.
Online databases - Vladimir is working with the online commissioning
group
on requirements and design for the support of online quality; capture of
epics parameters;
Database Administration and infrastructure - the databases were down several
times for the installation of a new disk set. We are doing a root cause
analysis of why the process was so difficult. Otherwise the databases were
stable. Work is ongoing to understand the best scenarios for backing up and
restoring the offline production database as it gets larger. Database
servers have been established on a new Linux node which will allow them to
be moved from D0mino in the near future. A failover plan is being written
Comments from Harry Melanson:
Hi Ruth,
Some comments from a "consumer".
Reco (p10.14.xx) has been running on the "official real data
reconstruction
farms" (i.e. "the FNAL farms") since early February. This version
has been
accessing the run configuration database successfully, with NO (!) reported
problems. Being able to access that db allows the reconstruction to
correctly apply the SMT Lorentz corrections, which significantly improves
the position resolution of its 3-dimensional hits. This has been a
long-standing difficulty with previous reconstruction versions, and it is
significant that we don't have to worry about this again. It is also the
first deployment of a database in "reconstruction production mode",
and I
personally see this as a big success. THANKS to all involved.
The only problem we've experienced since p10.14.xx was deployed happened
"today" (corresponding to a "scheduled" d0mino
downtime). Since reco now
requires the run config db server to be up, when d0mino is unavailable, new
farm jobs can't start. In addition, no user (including me) can start up a
reco job. As you identified, moving servers to dedicated (with backup)
servers is the solution. From reco's perspective, this is important.
The current p11 status for reco has us being able to reproduce p10 results
within three weeks (with p11.03.00). This means using "flat
files" for
dealing with calibration constants for SMT and CFT. My plan is to then
start to initiate real db connections for detector calibration constants
(with p11.04.00 and beyond). You indicate that SMT is now ready, but that
there are issues with CFT. Feedback to you is that if CFT can become
ready
within 3-4 weeks, that will be compatible with reco in p11. This is very
important, if we can pull it off. Although I know that CPS and CFT are
coupled, it is much more important to get CFT "online" (well,
"offline")
than CPS.
The calorimeter calibration application is also important. If it can be
integrated into the reconstruction program within the next two months, in
the development build, then I think we are "on pace". This
means that
"first pass by the end of March" must happen.
Regards,

Simulation