Computing and Software Status

July, 2002

 

 

 

RECO                Database            Analysis Tools            Simulation               Central Farm            Trigger Simulation                 Remote Farms                Graphics            SAM            Level 3

 

 

 

RECO Status

Harry Melanson, 7/25/2002

 

 

Status of releases:

Recent progress:

Major outstanding problems:

Ongoing projects:

 

 

DataBases

Ruth Pordes, 7/29/2002

 

 

Applications

=========

Streams will be reported in the analysis tools report. Jeremy has been working more or less full time on this application. Releases. When the development database was down it was rerealised that the releases application is still not in production. We have rotated the responsibility to Jeremy who is the current person working with Harry to try and resolve this.

 

TriggerDB:

A significant amount of trigger programming was collected and entered into the trigger database to enable new functionality in the global trigger list (used for all physics data taking at D0) and trigger lists for special runs. The number of trigger lists in the database is now 92. New functionality fully enabled in this reporting period include Level 3 tracking and vertexing, Level 3 muon filtering, Level 2 EM, JET, MUON filtering with an additional filter requiring multiple objects minimally separated by a programmable distance in eta and phi. A new generic program has been developed to enable commonly extracted information to be easily accessed from the trigger database. All these changes need to be cut into production. Terry Wyatt, has been trained to enter trigger programming into the trigger database. He has been authorized to enter data into the production instance as well as the development instance (for practice). Mark V. Kane, has joined the effort to develop some online tools (in python) involving trigger and luminosity information.

http://www-d0.fnal.gov/~gallas/d0_private/trig/triggerdb_todo.html A few tasks on the list are complete or nearly complete. Many tasks are underway, addressing the most urgent first, but many others will be delayed indefinitely.

 

LuminosityDB:

Based on initial usage patterns and needs this application will be revisited and redesign using the online acquisition system to do the calculations and fill an offline only database. Will not store any of the raw data. Will be able to use the same interfaces and tools as exist now.

 

RunsQualityDB:

is in use by the experiment.

 

Database Infrastructure and Administration

================================

Database applications next generation (DAN):

Testing and progress have been steady but have been severely impacted by vacation (Steve) and a lock up bug in the server that has taken several FTE weeks of effort without complete resolution. The task list is being maintained and full testing on the farms and integration is scheduled for September. Margherita has come up to speed, and is on vacation most of August

 

Database servers:

It was discovered that the deployed production version of the calibration database server did not match the development version. Once the latest version was installed in production there were no further crashes. A mini review of the runs configuration database server resulted in the recommendation that the connection to the database server be closed by the client after every read. The hope is that this will increase the stability of the server. The dbas are maintaining a weekly status of the number of server crashes on d0dbsrv1 (linux) and their cause in order to try to reduce them and analytically analyse their causes. http://d0db.fnal.gov/d0dbsrv/d0dbstatus/d0_db_weekly_jul1202.html

Ongoing discussions of the support model for d0dbsrv1 have been taking place between ODS, D0A and the experiment. A development system is being delivered by D0A. d0dbsrv1 was taken down at no notice which successfully stimulated the failover scenario. This is all still an open issue.

 

Database Instances:

A mail from the DBAs on July 12th: The 3rd partition filled last night, and we have rolled into the 4th. Sam now has stored more than 150 million events! The 3rd partition started filling on 25-March-2002, so it took 3 1/2 months to fill it.

SQL> select partition_number,number_of_events from event_partitions;

 

PARTITION_NUMBER NUMBER_OF_EVENTS

---------------- ----------------

1         50005832

2         50113669

3         50097909

4         1221610

Patches were applied to oracle apache in response to a security alert. The load on the databases continues to be monitored. A new version of dbatools with additional monitoring tools was developed and released.

 

 

Analysis Tool

Wyatt Merritt, 7/29/2002

 

 

 

Simulation

Serban Protopopescu, 7/24/2002

 

p11 status

D0reco_x: p11.09 does not apply NL corrections to MC data, p11.10 does. Long saga. With p11 on remote farms need to start machinery for merging with zero bias data

 

P12 status:

 

Plans for p13:

Main goal is to make it possible to merge MC and zero bias data.

 

Central Farm

Heidi Schellman, 7/29/2002

 

Running p11.09.00 on recent data and reprocessing the special stream backwards from June 3rd. processed 6.5 M raw events with p11.09.00 reprocessed 1.3M special stream events as well. Zero suppression change has raised size of output and slowed code from < 20 sec/event with p10.15 on old data to 50 sec/event on a 500 MHz processor - we are falling behind. >> p11.10.00 available on the 29th, will test and switch to running it.

 

Configuration/operation issues.

Sam crew made major improvements to the station, delivery of files to worker jobs is now very efficient and fast. This has raised our operating efficiency substantially, got sustained CPU use of 80% over one week after the new station was installed. Still some residual problems with files ending up in a strange state where they are not removed when unused. Trying to find what causes this. Requires a cleanup of cache and station restart once a week. Puzzling as cleanup algorithms seem to work for most cases.

Getting lots of failures due to corba errors. We are not fully protected against this, partially because not all corba errors are passed to sam cleanly. Maciej and Sam team are working on it. Have rewritten scripts to avoid db access or wrap in system calls so errors are caught. Not a long term solution, corba error rate either has to go down or need to catch all conceivable errors.

 

corba errors seem to be load related -

lots happen during oracle machine backup when machine is heavily loaded, also correlated with very slow query times to db. also correlated with job startup, where farms are making a very large number or db requests. Worst part is that we do the evening job submissions during the oracle backup.

 

 

Addendum by Mike Diesburg, 7/31/2002

 

 

Trigger Simulation

Dugan O'Neil, 7/31/2002

 

The current version of D0TrigSim is p11.10. The certification sets for this version are still running. See http://www-d0.fnal.gov/computing/trigsim/cert/trigsimcert.html which contains standard sets of plots for each production release as well as the macros used to create the plots.

 

The most serious concern in p11 is a memory overwrite problem which appears to corrupt L1Cal outputs when L3 is run. This is still being investigated, but is very hard to find. New tools (valgrind memory checker) are being employed.

 

p12.01.00 is still being built and tested. It contains many improvements for the handling of new L1/L2 cal data and L1FT data not present in p11.

 

 

 

Remote Farms

Iain Bertram, 7/31/2002

 

July will be the last month running mcp10.

20.5M reco events in sam from phase mcp10 and reco certification samples.
(See http://www-d0.fnal.gov/computing/mcprod/Stats_2002_06.htm for details)

Preparing mcp11 which will consist p11.10 executables running on RH7.1 operating system in most cases. UTA, Nikhef have all been down for a significant amount of time due to farm upgrades. Hopefully better service will resue in the next month. To offset this we have had significant contributions

Lancaster has made sevarl user errors that has reduced its output significantly and has also been working on developing the new Metadata system.

The Request System and new Metadata will be used with mcp11. Documentation by 2 August 2002.

 

 

 

 

Graphics

Laurent Duflot, 7/31/2002

 

 

 

SAM Status

Lee Lueking, 7/31/2002

This has been a soul searching month for SAM, with the reorganization discussions, many new  people coming to the group, and several of the anchors  gone or on vacation.   Nevertheless,  we have accomplished many tasks, including CRC file transfer verification and a new station monitoring SAM-at-a-glance page. Many cleanups have been done and the current version seems to be working pretty well on the farms.

 

D0mino

D0mino has seen continued record setting file delivery action in July, as seen in the chart below. We have seen a couple of near meltdowns on Friday afternoons when users have attempted to submit many projects for the weekend. This has caused the system to go into quasi locked up states. The solutions have been to constrain users to only run as “dzero” group, which eases cache management observed when users try to submit as some other groups with smaller caches. We have also deployed a new station v4_2_0_7 that has higher performance and we hope solves some of the logjam problems. We are now discussing changing the way projects are submitted, and contemplating going back to a more straightforward mechanism that will give much of the scheduling control back to the batch system. This allows the batch system to limit the number of  active projects for each user, and removes the problems we have had of running into the project limit for the groups. 

ClueD0

When Chris left, he had almost completed his testing on Clued0, and left a fairly complete description of what he had finished. The system was however, left in a development state, and needs to be cleaned up for production. This will take a little time, and then can be tested by beta Dzero users.  There is every indication that things are working and if the beta tests are successful, we can start allowing a wider user audience.

CAB

Heidi and  Marco have been running additional tests on CAB this week. After some hiccups, it seems to be working ok. They have made some recommendations for minor changes before we let others run there. These are not difficult and additional users can start soon. We do need someone to admin this station and field user questions.

Remote Stations

We are seeing quite active usage of the remote stations at several sites. John Weigand is working on extracting transfer information from our logs and presenting it for monitoring. The file  transfers to the most active sites form enstore are shown below.  Note the list of active stations getting data via their stagers running on d0mino.

Karlsruhe has started a program of pulling over all of the TMB data produced. They run a project each day and pull all of the files. After upgrading to the latest station software, they have been running very smoothly. Christian Schmitt (Wuppertal) has done a lot of this work with help from Andrew Baranovski. 

 

SAM-Grid

Work continues to flesh out the software needed to complete the proposed SAM-Grid architecture. Progress has been slow while Gabriele was devoted almost full time to CDF.  Work is ongoing with the Condor team to enable the Match Making Service they are providing and integrate it into the system. This work has been delayed  due to   Condor collaborators' delays with the Condor development release, as well as  contributions to  other D0 and CDF sam related tasks. Towards the end of June, we received the Condor release implementing the two changes in the classAds and in the Match-Making service.  We have started testing the changes, employing a graduate student and a fraction of an undergraduate student, resulting in a successful demonstration. The gatekeeper for a job has been selected by the MMSusing the information from an externally provided script. There are a number of outstanding technical issues which we need resolved by the Condor team. Looking at  the  larger picture our proposed SAM-Grid  architecture is composed by 3 major modules: 1) The Data Handling System (SAM),2) The  Job Management, and 3) The Monitoring and Information Services. While  concentrating on the area of the job management, we are also exploring MDS and other needed middleware for the information services.  SAM will be used for the Data Handling system with some  modifications.

 

Igor and Gabriele presented at ACAT2002 in Russia. Igor’s presentation was a plenary talk covering data handling at D0, and included many details of D0, SAM, and SAM-Grid. Gabriele’s talk  concentrated on the SAM-Grid architecture, plans and progress.  We are also preparing for  SC2002 display in which we hope to highlight much of the SAM and SAM-Grid work. The display will be Grid oriented, and is being done in cooperation with CMS, BaBar, CDF, D0. Of course, Condor is a major part of our Job Management architecture and  they will be represented too. 

 

Much of our time over the last 3 months has been spent helping CDF evaluate SAM and begin to integrate it into their data handling system. Gabriele has been diverted almost entirely away from the Grid efforts to the CDF project and this has postponed many of our anticipated goals.  Lee has also spent a large amount of time working with CDF to coordinate the  SAM project to include their needs, as well as D0’s.

 

Igor and Gabriele have spent a large amount of time training 4 students who are working with us for the summer. Two of these are graduate students from the University of Texas Arlington who were hired through the D0 collaboration  with the HEP department there. These students have been involved with our Job management, and monitoring and information logging  projects.  We hope to continue them on beyond the summer as they have become very useful to the program.

 

Lee has been working to enable additional monitoring for the network data transfers for the existing D0  system. We are working with the SLAC networking monitoring group (Les Cotrell), and using the software tools they have developed for the IEPM project.  Our networking department  (DCD) has deployed a monitoring node at FNAL. We have begun to deploy clients at D0 processing and analysis sites to better understand the network capability of our growing system. At the same time, we are developing better tools to monitor the actual data flows throughout the deployed base of d0 SAM stations worldwide.  This later work is anticipated to be used to generate displays for one panel in  the SC2002 presentation, mentioned above,  being prepared jointly between FNAL and SLAC.

   

 

Level 3

Dan Claes, 8/5/2002

 

The global_CMT-8.00 triggerlist, introducing both L3 global tracking and the track-based primary vertex finder online, ran for the 1st time on Friday 26th July. Terry Wyatt checked the online statistics and reported:

 

l3fanalyze........................................................Jon Hays

The list of tools to be analyzed now is configured by a text file which can be automatically generated from a trigger list. A modification to d0tools makes use of this facility automatic and will be used in running l3fanalyze on the reco farm. A word in the DebugInfo chunk now distinguishes "online" from "offline" and comparisons between the two results are now possible within in the same job.

 

L3 Trigger Examine.............................................Elliot Cheu

Elliot Cheu (Arizona) has shown plots of electron, jet and tau multiplicity, and recoed physics results collected by the L3 trigger examine during online datataking. The generation of such plots should become a routine shift responsibility and documentation is being made initially available for shift captains. This examine potentially generates an enormous number of histograms, so we need some useful way of selecting, summarizing and displaying the results. The current scheme makes separate histograms for each tool (and trigger calling it), but it may be desirable to group related triggers together, perhaps by stream.

 

L3 Monitor...................................................Moacyr Souza

Moacyr is working on accumulating filter rates on the .or. of overlapping triggers.

 

L3 Thumbnail...........................................Peter Tamburello

Peter has rejoined D0 as a postdoc with Arizona and agreed to take responsibility for implementing L3 physics objects into the thumbnail (for p13).

 

Primary vertex...............................................Chris Barnes

The question of falling efficiency with p_t cut in W->munu events remains unresolved. Chris will compare the performance of the track-based and smt-hit-based primary vertex finders. This study may help determine if an smt-hit-based primary vertex finder is practical at L2.

 

Global Tracker..........................................Daniel Whiteson

Updated rejection values (of the standalone track filter applied to single electron and muon triggers) have been reported (based on 100K events from our special run). Daniel has added a track quality cut to improve the rejection for the single track filters. The proposed cuts would require 10 axial hits (at least two smt hits, plus stereo information). These requirements reduce the number of fake tracks by a factor of two. The following thresholds for standalone track filters provide the estimated rejection values given below: 

 

L1 trigger                             p_t cut (GeV)         expected rejection 

CEM(1,10), CEM(2,5)     one track > 25                     150 

mu1ptxwtzxx_fz                 one track > 10                     45 

mu1ptxwtzxx_fz                 two tracks > 3                     16

 

Daniel will use a picked Z->ee data sample (from Volker Buescher) and a Z->mumu sample (from Gavin Hesketh) to get an estimate of the L3 global tracking efficiency for high p_t isolated tracks in data. A large number of cft unpacker error messages indicate an incorrect cable map and/or corrupted data. We should cross-check whether or not the vertex examine or offline reco are producing the same errors.

 

Electron.................................................Ulla Blumenschein

Statistics based on 350 Mark&Passed CEM(1,10) events show rejections look similar to those obtained back in April (lower than those obtained in the June tracking special run). This is consistent with the claim that the L1CAL energy scale has now been reset to the value it had in April. (June's special run had the L1CAL scale about 10% high.) The overall rejection factor (the .or. of all 5 single electron filters hanging off CEM(1,10)) is ~16. These measurements will be repeated on a larger statistics sample. Using Z->ee data, Ulla compared the efficiency of the shower shape cuts used in level 3 with those made offline. The L3 cuts are 100% efficient for high p_t isolated electrons selected by the current offline cuts. These studies will be repeated using Z->ee events, selected with less stringent shape and isolation cuts, but with matched tracks (to keep the background down).