Computing and Software Status

May, 2002

Central Farms

DØGrid

Remote Farms

RECO

Online

L3

Tools

Simulation

Databases

Infrastructure

 

Farm Status

Heidi Schellman, 6/10/02

 

Configuration

Hardware:

Farms currently consist of 20 staging nodes fnd01-20

20 500 MHz nodes

50 750 MHz nodes

32 1000 MhZ nodes

for a total of 204 processors available for reconstruction.

We have had many problems with the fiber channel disk - probably the controllers not the disks

themselves.

OS:

We upgraded from 6.1 to 7.1 in late March with no problems until an optimizer bug was found in muon

code last week.

 

Executables:

We ran p10.14.01 over the sample taken from Nov-Feb 1. (129190-145043)

We ran p10.15.00-02 over data taken from Feb 2 - Jun 3 (145050-)

We plan to run p11.08 or 11.09 on the recoS stream from whenever the CFT got fixed to now.

All but a few million events are done but we had to back out of p10.15.02 maxopt because of an

optimizer problem,

now running p10.15.02 debug but about 10M events were processed with the bug before it was caught.

 

Control scripts;

The file merge scripts were upgraded in June, still being tested for p11.07, not intended for p10.15

 

Monitoring:

Marco Verzocchi has written an excellent monitoring page at: http://www-d0.fnal.gov/phys_id/luminosity/data_access/processing_status/ which helps monitor processing and merging.

 

Future upgrades

Working on getting Linux merge nodes capable of being primary output holding area. More nodes will be delivered in August/Sept. Bids are now in progress. Work on using run time environment instead of our customized environment.

 

Top of the page

 

DØ Grid

In the month of May, 2002, we came closer than ever to producing real work in the area of job and information management for the d0grid:

 

1) We have detailized the plan for the near-term deliverables for the  project. In the absence of clarity, we had to make reasonable assumptions  about the manpower avaiable and the participation of our PPDG collaborators. I arrived at a plan assuming 1.4 +- 0.1 FTE + 3 students of people at FNAL, a D0 collaborator at IC-UK and a few people at UTA. The plan's posted on the d0grid page.

 

2) We deeper understood the applicability of the standard grid monitoring facilities for the needs of d0 meta-computing. (This in part enabled the previous item). Specifically, the MDS software from globus appears to be useful for describing resources such as SAM stations and MC-capable farms, as well as for monitoring of the running jobs. We have understood how to write providers of such information to the point that we can train the students. We have further understood that MDS is not quite useful for historical display of finished jobs, so that one will need a logging service very similar to that in SAM.

 

3) UTA people have already made progress in making a toy display of the MC jobs by interfacing MCFarm with the standard monitoring tools. Work remains to extend such displays onto all kinds of D0 jobs.

 

4) UTA people continued studies of the grid tools, as coordinated in the d0grid meetings. The primary example is the DAGMan tool allowing reliable execution of complex jobs with dependencies. It is too early however to judge how this will be incorporated into the full job management.

 

5) Rod Walker at ICL-UK is still working on the full-fledged inclusion of Condor as a supported batch system in SAM. He's been needing help with the technical issues of programming C++ with CORBA and such. The work on including GridFTP also proceeds slowly, an issue I'll soon be addressing within the group. The slowness appears to be a result of us not exactly  understanding of what D0 wants as far as GSI is concerned, as well as by the obscure status of a grid authorization service (CAS).

 

6) We have welcomed the three students, set up their accounts and introduced into d0 data processing with SAM, using Condor and Globus. They are expected to start useful work in early June, starting with little but very important items like breaking the grid configuration barriers between UTA and FNAL.

 

7) Condor team, a PPDG collaborator of D0 has not as of end of May delivered the changes to Condor requested by us. Hopefully, it'll be in very soon.

 

8) d0grid is now including CDF in the grid area. This has the impact of increased coordination. Just how much remains to be seen, but we hope that the benefits will outweigh the expenses.

 

Top of the page

Remote Farms
Iain Bertram, June 6, 2002

Software: mcp10

18.5M reco events in sam from phase mcp10 and reco certification samples.
(See http://www-d0.fnal.gov/computing/mcprod/Stats_2002_06.htm for details)

The current release of software on the farms is as follows:
  Generators:     Not fully functional p10.15.01
  Dogstar,D0sim,d0reco,recoanalyze:    p10.15.02
  MagField:       v00-01-00

Current requests are at:
http://www-d0.fnal.gov/computing/mcprod/Requests/Requests.html

10.15.03 seems to have fixed the pythia problems.

The Request System is undergoing testing and is nearly complete (report).

 

Top of the page

RECO Status Report 

Harry Melanson, June 3, 2002

 

Current official production version: p10.15.02

Latest production version: p11.08.00

p11.07.00 was scheduled to become the official production version on May 21, 2002. This milestone was not met. Two major problems contributed to not meeting the milestone: 1) a bug in the calorimeter non-linearity correction software causing all calorimeter clusters to have invalid energies, 2) incomplete testing and documentation of RECO. The calorimeter NLC bug was fixed in the final p11.07.00 build, and we are currently running p11.07.00 on the farms processing various test samples. So far, the performance looks as expected. However, there appear to be more crashes in RECO / RECO_ANALYZE than in p10. These are being investigated and patches are being applied in p11.08.00.

 

p11.08.00 is scheduled to finish on June 5. (Since there are two power outages scheduled for this week, the exact date is a little uncertain.) p11.08.00 contains a few bug fixes (e.g. floating point exception fixes).

Algorithms and Object ID groups are in the process of documenting and signing off on p11. It should be available for deployment to the farms by the end of the week.

 

Plans for p12 are being discussed now. Each Algorithms and ID group has been asked to supply a list of goals for p12. In addition, a list of specific tasks to accomplish those goals, the personnel assigned to those tasks, their level of effort available (%FTE) and when the tasks should be completed is being compiled. This exercise will be done for p13 as well. The current RECO schedule is

It is hoped that with the development of a task-oriented schedule, RECO milestones will be more reliably met.

 

Top of the page 

Level 3

Dan Claes, May 22, 2002

 

Online Monitoring

Online rejection factors for the new (global_CalMuon5.0) electron filters are very close to predicted values (per Ulla Blumenschein). Jet filters provide about half the expected rejection, perfectly adequate at current rates.  

 

We need better (maxopt) timings of all current tools, and this could be facilitated by access to the online statmanager. Manpower for this is yet to be identified, but Moacyr Souza is ready to work with someone on this.

 

L3 muons...............Martijn Mulders, Martin Wegner, Christophe Clement

Both the dynamic unpacker and muon_geometry memory-leak were tested and released with p11.04. A large numnber of offline RECO code changes (including major mods to segment finding) accompanied the release. As a consequence, MDT code crashed the initial build, and a huge memory leak was introduced. Further complications included about 50 events exceeding the 30 30 second online time-out. New Muon Code Management procedures are now in place that include running L3 memory and timing studies on standard data samples before any muon code is sanctioned for release.

 

Modified local track-finding parameters (tuned number of iterations and propagator step size) improve tool timing, with negligible loss in performance. Running p11.04.01 (maxopt) with the tuned paramters on a Pentium III 1 GHz machine dropped event average from 199 msec to 18 msec.  

All known fixes together with the newly tuned parameters are part of p11.06.00.

 

Tracking Geometry and Unpacking.........................Robert Illingworth

L3GeometryManagement now allows instantiating the full offline geometry, improving substantially the relative alignment of the subdetectors. Studies (Daniel Whiteson) demonstrated better performance (number of smt hits, impact parameter, ect) over any of the simple RCP-based geometry options. L3SmtUnpack now offers fully dynamic crate/VRB configuration, parameterised pedestals and noisy strip killing. Timing real data on a 1 GHz Linux machine gave 3.9+/-1.1 msec/event.

 

L3TGlobalTracker...........................................Daniel Whiteson

Previously stereo tracking was limted by our CFT coverage. Now SMT-only stereo tracking on CFT+SMT axial tracks is invoked whenver CFT stereo fails.  

 

L3Ftrack, a standalone track filter, will select on n tracks above a Pt threshold.

 

L3 Primary Vertexing......................Per Jonnson, Christopher Barnes

The track-based primary vertex tool is being exercised using both CFT-only and global tracks. The algorithm finds the average Z of (>2.0 GeV/c) tracks with 8 stereo and 6 axial hits (and chi^2 per dof <5). Events whose tracks all fail this requirement simply enter all tracks into a histogram (in real data this occurs 28% of the time). Monte Carlo studies suggest resolutions of ~500 micron are possible, though current efficiency on real data is about 40%.

 

Given the new flexibility of global tracking (see above) test will be run reducing the cut on the number of stereo tracks from 8 to 3.

 

Calorimeter Unpacking..........................................Marumi Kado

Individual channel non-linearity and gain corrections introduce a small aqdditional cpu demand at construction when uplolading coefficients, but no visible burden when running (Robert Zitoun). The current package implementation feeds data in through flast files, eventually to be handled by the Calibration manager.

 

Trigger rate studies (Tibor Kurca) of the effects of calorimeter hot cells are being run. A L3 version of NADA (Gregorio Bernardi) proposes running an algorithm that flags hot cell candidates by a 3 GeV threshold (optimized for timing) and enforcing a dynamic threshold of max(100MeV,2% E_hotcell) on neighboring cells.

 

Top of the page

Infrastructure/Code Management Status

Alan Jonckheere, Mar, Apr 2002

 

sections: Code Management, Releases, Resources

 

Code Management

At the last report we had ceased doing routine "t" builds on RH6.2. We are still doing RH6.2 production builds since the farms might still need it. We still aren't doing maxopt "t" builds due to lack of requests and lack of build CPU. So we are now routinely building "t" releases on Irix, and Linux RH7.1 Debug only (mostly), but Irix, RH6.2 and RH7.1, debug and Maxopt versions for production.

 

"t" releases:

We had just frozen t02.06.00 Feb 25. By 4/30 we had frozen t02.15.00. The "t" releases are suffering due to lack of attention. Most of that effort is going into the production releases. It is unclear how useful the "t" releases are for the users. Presumably they are useful since we don't hear a lot of complaints. But then problems can go a couple of weeks before being discovered. Often that happens when we delete the oldest release, forcing people to use a newer one which has a problem. Unfortunately by that time it's impossible to go backward, so there are *no* usable releases until the problem can be fixed.

 

"p" releases:

In the last report we had just finished p10.15.01 and were just starting p11.02.00, p11.01.01 was running on the farms. Since then we have cut p10.15.02 (first week in May, so not really in this report) with a very minor change to beam_tilt to *not* abort reco if it finds no smt tracks in the first 200 events. The latter is a common occurance in special runs and is only needed to find the beam position for feedback to the accelerator.  

 

We had just started p11.02.00. Since then we've done a new p11 release about every two weeks. By the end of April we had frozen p11.06.00 and testing was well underway. It is hoped that p11.07.00 would be the final one in this series. But we shall see.

 

Build Resources:

Build Machines

The major resource problem we are having these days is lack of machines. Builds on d0mino have taken as long as two days (44:20 hours) and almost 24 hours on the other two build machines. This is with at least one other build occuring at the same time. This has gone up from about 8 hours on d0mino about a year ago when we first installed the parallel build system. The growth in time has been fairly slow but steady. It can not be attributed to an increase in packages or anything like that. But it does pretty much parallel the use of d0mino. However, there should be plenty of resources available on d0mino. So this is not well understood. The build times on the Linux boxes are pretty much what we expect. D0lomite (RH7.1, 750MHz, 8 processor) is only a little faster than d0lxbld4 (RH6.2, 500MHz, 4 processor) because it's using disks nfs mounted from d02ka. This costs about a factor of two. However, we would like to keep this arrangement because in this configuration the builds are instantly available to the entire Linux world at FNAL, ClueD0, in particular.  

 

The above is exactly the report from Feb. One bottle neck has since been identified on d0mino, but not corrected. All nfs mounts, including those to the /usr/products and /d0usr/products disks go over the default network interface and all of it's interrupts are handled by a single cpu. When the load is moderate, this is no problem. But all internet traffic from offsite as well as all interactive traffic also goes over that interface. When the MC farms are sending large amounts of data, this totally swamps the cpu handling the interrupts. This effectively kills almost all uses of d0mino. Interactive response goes to nill as does any activity that requires accessing the products disks. Since the compiler is on that disk, there goes our build times.

 

One other problem that we've run into: we simply can not do two production releases and a t release in one week. d0lomite, the RH7 build machine just can't handle it.

 

Disks

On /d0dist/dist/ we have:  

d0mino       

205GB 3xstripped and shadowed main set 

137GB 2x stripped secondary set  

36BG for tarfiles  

the main/secondary sets keep the two sorts of builds from interfering and at least one from interfering with the users

d0lxbld4 

100GB RH6.1 served to d0lxbld1 (very few releases there)

d0lomite 

262GB on d02ka but still have 215GB locally if we need it. Using the nfs disks slows the builds a lot (30% at least) but it makes the builds available immediately. So far, this is judged to be more important than speed.  

d02ka 

262GB (same disk as above) RH7.1 builds served to clued0 etc

Top of the page

Online

Stu Fuess, May 20, 2002


No major projects/upheavals in progress...

 

Top of the page

Simulation

Qizhong Li, 5/17/2002

 

D0gstar:  

No changes in D0gstar in the past month.

D0Sim:  

There are changes/fixes made in CFT digitization. This improves the CFT tracking.

D0Raw2Sim:

D0Raw2Sim was tested with Feb. zero-bias run. SMT and CPS both found problems and fixed. Then we tested the fixed code with Feb. zero-bias run. But that run was taken with too low luminosity. Recently we requested a new zero-bias run with higher luminosity. We just started to test code with this run data.

D0TrigSim: (from Dugan)

L3 is now running p11.06.01 online. L2 is running a p11.06.00-based release online. The trigger simulation is working on data and MC inputs but has shown some instability. A known problem pertains to L3 analyze packages producing empty ntuple blocks. A solution has been added to p11.07.00. d0_analyze, which is scheduled to replace reco_analyze on the online farms, is working and combines reco and trigger output in a common ntuple.

PMCS:

In the last month the development of PMCS has been focused in the following areas:

Other effort:

Preparing the Simulaion part of D0 Offline documentation. Preparing the internal review (last week) on Simulation related tasks.

Top of the page

Databases

Ruth Pordes (5/15/02)

 

The RunsQuality Database and initial Application is in test in development. Stefan reports he "has entered Tom's Muon good run list and the MET good run list taken from the MET web page in the first (development) version of the offline quality database (thanks to J.Simmons). The standard quality word can be access thru Run Quality Query. In addition, you can also get Tom's quality grades directly by asking for parameter name 'grade' in the run quality parameter query. 

http://d0ora3.fnal.gov:8508/qualitygrabber/qualQueries.html 

Again, this is a development version. So it would help if you play around a bit and report problems/suggestions to me.".

 

Jeremy has developed a first version of the Luminosity Database offline access application and will be working with the Analysis Tools team and Luminosity group to extend this as needed for analysis. As he says: "I created a new offline lm_db website with misweb queries: http://d0db-dev.fnal.gov:8508/lm_db/ "

 

Taka and Lee organized a successful workshop on database browsing tools. As a result Eric Myers and others at the University of Michigan will make the CFT database browsing tool more general, and move it to Unix, and make it available as a template/example for other detector groups. Andre will extend his calibration browsing application also to be available as an examply application that allows for updating of the information, in addition to access to the data. Both groups will look at carrot ( a web interface to root) http://carrot.cern.ch as an extra tool for histogramming and displaying data from the databases.

 

The Trigger Database application is being upgraded to allow easy modification of one tool, term or trigger, with automated modification of all the dependent terms. This is 2/3 finished. Minor upgrades are in progress in parallel by John Weigand who is working closely with Elizabeth. We have asked for a meeting with L3 to review the short term needs. 

 

We assume that the issue of access to the Run Number in the online examines has been completely solved. (Stefan was going to follow this up with Pushpa)

 

Small changes to the database tables were made in support of SAM, and additional tables made for the Runs Quality database. As the CDF version of SAM is deployed the data base schema for the 2 experiments are being kept completely synchronized, with tests occuring in the D0 development database before any changes are made anywhere.

 

Significant progress was made on the extensions to the database server infrastructure to support in memory and disk caching, set the stage for proxy disk servers, and support multiple connections. Jim and Steve are starting to provide versions of the new software to Jeremy to test in the luminosity application. Jeremy has upated the database server documentation and provided an easy to clone example and instructions. Margherita Vittone from ODS is starting to learn the database server infrastructure in order to be available to help the application owners transition between the old and the new when the time comes. She will test the new documentation.

 

Instability and errors in the Runs Configuration database application caused problems for people using Reco. This was traced partially to problems in the caching algorithms of the database server - and fixed. Slava is adding messages to the client side of the application to let users know to email d0db-support@fnal.gov if the application fails to get the correct runs information after 5 retries.

 

Jundong Huang worked with the DBAS to move to getting Muon MDT application into production. Zhong-Min has been developing a new database application. There is concern that database applications are not being upgraded to use new code as bugs are found and features added. We must find a more robust way of promulgating these changes.

 

The months of April and May saw increased activity in the offline databases. The DBAs have started looking into optimizing those queries and database server accesses that take a significant time and/or impact other users. logs are invaluable here. The new linux database server machine was commissioning and failover scripts and mechanisms developed. This was tested through a glitch in the network, and of course it was found that 2 sets of servers where then started up. The scripts were made more robust to these kind of transient errors. http://d0db.fnal.gov/d0dbsrv/ . It seems the negotiations between Oracle and DOE are progressing, although it is not known how this might affect the labs licencing structure. Bill Koncelik is handling this. 

 

We need to address the issue of a single naming service shared between SAM and the database applications. This will be discussed at the databases meeting on Friday.

 

Top of the page

Tools

Wyatt Merritt

 

I believe I am down for a status report on analysis tools. Nothing significant happened with regard to analysis tools work in April - we are now starting to hold meetings and start work, but this began last week and I will report on it in a May status report, if that's OK with you.