Computing and Software Status

 June, 2002

 

 

 

SAM                            Farm                    Monte Carlo Farm                    Remote Analysis

RECO                         Graphics               Infrastructure                           Level 3

Simulation                   Databases

 

 

 

 

 

 

 

SAM Status Report

Lee Lueking (6/24/2002)

 

1.    SAM is heavily used by the collaboration as shown by the following stats.

Many groups have been storing MC data in to the system for several months.

In the last two months we have Tom Diehl has started storing picked events into SAM using the metadata files that are generated by the SamManager package in conjunction with the d0Framework. Other types of data storage have been requested and are being worked on.

 

A new station called CAB was deployed on D0mino to manage the D0mino back end compute servers. There are currently 16 nodes, 32 processors,  that this station employs as a distributed cache. This has been extensively tested both by Chris Jozwiak and now by Marco Verzocchi. We feel we understand all the issues causing problems on ClueD0 and hope to have a new station server machine installed and running this week.  We would like to thank Chris for all his work testing all of our systems including these, the farms and helping remote station admins and  everyone at D0 over the last year. Chris will be leaving July 5th going to graduate school at Berkley. Good luck Chris!

 

We are now working with CDF to enable them to use SAM, and several of their developers are teaming up with us to further improve the software for both experiments. This effort has already much improved the installation procedures, and we look forward to working with them in the coming years on both SAM Core and SAM-Grid projects.

2.     Shift Report

There are a total of 40 shifters, including SAM core gurus.  In May we added  GMT+5.5 (India) to our time zone coverage and this is working well. Don Coppage has been the shift coordinator for the last 3 months and this is a great help. We plan to extend the shift coordinator responsibilities and add a second person. We need to redefine  some of the shift assignments so the D0 shifters are more responsible, and the gurus are less relied upon.  The shift log tool is useful, but would be more so if the shifters would use it more extensively.

3.     Sam station version features,  release notes , etc

The latest version of SAM station is 4.2.0.3.  This has been deployed to the farms, CAB, and ClueD0. It will be installed for Central Analysis  on July 2 during the Monthly down day. Following are the release notes for this version.

/*  v4_2_0_3

 

Bug Fixes:

  1. Fixed station crashing problems with --excess-satisfaction=0

  2. Fixed --route not working with --constrain-delivery.

  3. Fixed problem with the long period for creating intrastation transfer requests after a cache hit.

New features/parameters:

   Outstanding major issues:

  1. Problem with small projects overlapping with big ones started first. The small one won't make any progress until the files are delivered for the big one.

  2. station should not retrieve locations for all files when a project starts, but do it on a need-to-know basis

  3. Farm process timeout issues: endProcess needs to work in the project master, project master should not consider files opened after getting exception from a process, and should try to re-deliver a file to a different node if all retries fail

    Outstanding minor issues:

  1. Possible problem with caching algorithm (not enough space for new files/encp's running out of space). (previous release)

  2. Another possible problem with caching/uncaching algorithm (not uncaching the least-recently-used files). (previous release)

  3. Multi-submission of jobs for distributed sam project. Code needs to be thoroughly tested. (previous release)

  4.  --impatiend-end improvement for the "Chris scenario".

  5. The group disk allocation in the distributed environment needs to be designed/discussed, as well as oversubscribing of the available disk space in a single node environment.

  6. Station tries to deliver a file to a node that is down (farm environment).

  7. Problem with file transfers in error. Those can create situations in which project is not getting good files while bad files are retried many times. File transfers in error should not be retried immediately, but instead a new file transfer request should be created, put back into the queue, and optimizer should be asked for a new authorization (with reduced number of retrials for that location). In the short term, the number of retrials will be made configurable.

  8. Intrastation transfers get behind encp's in the queue, causing slow file delivery.

  9. get_enpc_priority.py script needs to be dealt with... This is actually now part of sam_encp.

  10. Inner workings of LRU need to be looked into.

  11. Project master dump output needs to be fixed.

  12. What the heck is the ProjectJobMonitor intended for? Should inherit from the BaseInterface for diagnostics...

  13. Project master code needs to be revisited for exception handling (e.g., InFiles.cpp).

  14. Robustness against inconsistencies found in the database. Example is a snapshot file that has no locations.

  15. If project snapshot consists of files that cannot be delivered the project stays in the suspended state forever.

  16. Fix enstore output parsing. New class should be introduced to encapsulate all parsing issues for both station and fss.

  17. Dead stager problem in the distributed environment (if stager dies, and batch system keeps sending jobs to the same node) 

    Desired new functionality:

  1. support for the nfs mounted disks

  2. CRC checks for fss/station master

 

4.    New features, triumphs and problems

Among the major triumphs for the current release are routing and constrained station delivery working, and functioning together. The routing allows for files to be routed through another station (CA for example) on their way to their final destination. The constrained delivery enables them to go through a specific node at the destination station. This, coupled with the work Sinisa has performed to enable the station to function with a distributed naming service, has enabled us to create a working version of SAM running behind a fire wall on a virtual private network. The first example of this configuration is at Nijmigen on the station with the same name, on the node called doesburg.  Several sites have asked for this and it will become a popular way to configure the system, we believe.

 

Over the last several months there have been several reports of corrupted files. We believed that most of  these originated from Lancaster, and the FNAL farm. The farm files were caused by the way the root file was closed when the program crashed after the last event. There was not much we could do about these files, other than mark them as “unavailable”. The Other files were believed to have been corrupted in transit from the remote processing centers and our first suspect was bbFTP.  CDF had observed a rash of file corruptions which were believed to have been caused by bbftp so we decided to implement a CRC procedure to calculate a number for the contents of each file initially and after verify it after each transfer.  This effort is underway and should be completed in the next week. Since the time when CDF reported the rash of transfer failures, they have found that they were mostly caused by problems with their raid system. However, the crc is important to avoid file corruption in the future.

 

We are working toward using GridFTP, mainly so we can say we are using it in SAM. Once we have the crc features implemented we will be comfortable trying this new thing out. There are a few security issues that need to be worked out, but we feel we can have this all done in the next few weeks.

5.     Sam data movement 

There are several projects going on to better understand the actual data flows throughout the system. We are putting together a monitoring system that will enable us to look at statistics for inter and intra station transfers. This will allow us to see the number of files, GBs, and transfer rates throughout the entire system. We are working with the FNAL network group to get statistics from the border router on throughputs to and from over three dozen remote sites that have registered stations.   We are also working with the network group to deploy a set of passive monitoring software from SLAC related to the IEPM (Internet End-to-end Performance Monitoring) project. We hope to deploy this passive monitoring software with each SAM station in the future and look at network performance issues to each site. The network link to the outside will be upgraded to OC-12 in late July, and we anticipate much larger data flows in the coming months.

6.     SAM-Grid 

Work continues to flesh out the software needed to complete the proposed SAM-Grid architecture. Progress has been slow while Gabriele was devoted almost full time to CDF. Currently we have 3 Summer students and will get a fourth  one next month. Hope for 3 of these 4   to stay through the end of the year. Work is ongoing with the Condor team to enable the Match Making Service they are providing and integrate it into the system. Igor and Gabriele presented at ACAT2002 last week. Igor’s presentation was a plenary talk covering data handling at D0, and included many details of D0, SAM, and SAM-Grid. Gabriele’s talk wae concentrated on the SAM-Grid architecture, plans and progress.

 

We have completed a paper for the GRID2002 conference in November  as part of the SC2002 conference.  We are also preparing for  SC2002 display in which we hope to highlight much of the SAM and SAM-Grid work. The display will be Grid oriented, and is being done in cooperation with CMS, BaBar, CDF, D0 and other representative groups like Condor.

7.    SBIR has been approved. 

Matt Vranicar’s  application for a SBIR Grant was approved. This will be used to extend the work he has done relative to the dimension engine. This software is used in the dataset tools, like data set editor and dataset CLI’s.  There are many important features we plan for this work to enable in the coming months.


Farm Status 

Mike Diesburg, 28-Jun-2002

1) 

We are currently running p11.08.00 on the farms to reprocess the previously taken streamed data. 

The current set of SMT pedestal files in p11.08.00 is not correct for data taken after the shutdown. A new set of pedestal files was delivered on June 27th. We will cut a new version to process post shutdown data (p11.08.01?).

Note that a few files taken during the week of June 17th have already been processed with p11.08.00. These should not be used.

2) 

Two new versions of SAM have been installed on the farm in last 2 weeks. Neither have been successful. Serious problems with file delivery and replication have been experienced with both versions. We may have to revert to older versions of a fix cannot be implemented soon.

3)

We have had some network/NFS related problems on the farm I/O node in the last few weeks. One Gigabit interface was replaced which seemed to improve some aspects of the problems, but unexplained NFS timeouts have persisted. These are not fatal problems at this point, mostly just an annoyance. Some changes in kernel configuration were  installed on d0bbin today (June 27th) to try to improve NFS response. Changes in the mount options on the worker nodes will be put into effect at the next convenient shutdown.

4)

Planning is under way for arrival of the new farm nodes. We expect to  move the new nodes to two 4006 switches instead of connecting them to the existing 6509. The 4006 switches are well matched to the farm requirements and are more cost effective than the 6509s in this situation.

Since more nodes are being purchased than originally planned, more trunk lines need to be run to the rack area on FCC2. It may also be necessary to provide some temporary network connections via small switches or hubs to allow acceptance testing of the farm nodes before the 4006 switche arrive.

The farm order is being processed by the legal and business offices (don't ask me what they have to fo with this, I have no idea and I am sure I don't really want to know...). Bill Koncleik has taken over the order from Byron and he says it should go out today (Firday) or Monday as soon as it comes back from legal inspection. The specified delivery date is 6 weeks from receipt of order.

 

Monte Carlo Farm

Iain Bertram, June 28, 2002


Software: mcp10

19.7M reco events in sam from phase mcp10 and reco certification samples.
(See http://www-d0.fnal.gov/computing/mcprod/Stats_2002_06.htm for details)

The problem with Pythia that caused it to hang in tauola in RH 7.1 builds has been fixed. p10.15.03 generators work effectiveley now. The first herwig samples generated on the farms are being stored in SAM. The current release of software on the farms is as follows:


  Generators:     p10.15.03
  Dogstar,D0sim,d0reco,recoanalyze:    p10.15.02
  MagField:       v00-01-00

Current requests are at:
http://www-d0.fnal.gov/computing/mcprod/Requests/Requests.html

We have had some problems with multiple entry of recoanalyze files due to a mc_runjob bug and double processing. This mistake is prooving difficult to fixz since we have no tools to list all the parent files of a set of sam files. I.e. we need a tool that checks the uniqueness of files. UTA, Nikhef, Prague have all been down for a significant amount of time due to farm upgrades. Hopefully better service will resue in the next month. The Request System is undergoing testing and is nearly complete (report).

 

Infrastructure/Code Management Status:

Alan Jonckheere, June 28, 2002

 

Sections:

Code Management

Releases

Resources

Major Infrastructure products

RCP

EDM

Code Management

Test Releases General: We are routinely doing only debug builds on IRIX and RH7. We are not doing RH6 or maxopt at all due to lack of demand and resources. We maintain the one per week rate. Production Releases General: We are routinely doing RH6, RH7 and IRIX, debug and maxopt, (6 builds) per release. We will cease doing RH6 as of p12 (1 July).

 

"t" releases:

We are currently building t02.24.00 which is to become the basis of the p12 production releases. As in the last report the "t" releases are suffering due to lack of attention. Most of that effort is going into the production releases. It is unclear how useful the "t" releases are for the users. Presumably they are useful since we don't hear a lot of complaints. But then problems can go a couple of weeks before being discovered. Often that happens when we delete the oldest release, forcing people to use a newer one which has a problem. Unfortunately by that time it's impossible to go backward, so there are *no* usable releases until the problem can be fixed.

 

"p" releases:

Since the last report: p10: We have build p10.15.03 which was a change in the compilation parameters for one external product (tauola). No code was changed, but the monte carlo executables were rebuilt. This is/was the *last* build of p10. We have now (as of 6/24/02) converted the rcp file system database to the new multi directory organization. It is now difficult though not impossible to do another build. p11.07.00 was, in the last report, thought to be the last one in this series. It didn't happen for a number of reasons, none of which I know well. p11.08.00 is currently running on the d0reco farms. p11.09.00 is building. This one will probably never go to the farms, but, depending on how p12 goes, p11.10.00 might. p12.00.00 will be cut from t02.24.00 in the first week of July.

 

Build Resources:

Build Machines

The major resource problem we are having these days is lack of machines. Builds on d0mino have taken as long as two days (44:20 hours) and almost 24 hours on the other two build machines. This is with at least one other build occurring at the same time. This has gone up from about 8 hours on d0mino about a year ago when we first installed the parallel build system. The growth in time has been fairly slow but steady. It can not be attributed to an increase in packages or anything like that. But it does pretty much parallel the use of d0mino. However, there should be plenty of resources available on d0mino. So this is not well understood. The build times on the Linux boxes are pretty much what we expect. D0lomite (RH7.1, 750MHz, 8 processor) is only a little faster than d0lxbld4 (RH6.2, 500MHz, 4 processor) because it's using disks nfs mounted from d02ka. This costs about a factor of two. However, we would like to keep this arrangement because in this configuration the builds are instantly available to the entire Linux world at FNAL, ClueD0, in particular. The above is exactly the report from Feb. One bottle neck has since been identified on d0mino, but not corrected. All nfs mounts, including those to the /usr/products and /d0usr/products disks go over the default network interface and all of it's interrupts are handled by a single cpu. When the load is moderate, this is no problem. But all internet traffic from offsite as well as all interactive traffic also goes over that interface. When the MC farms are sending large amounts of data, this totally swamps the cpu handling the interrupts. This effectively kills almost all uses of d0mino. Interactive response goes to nill as does any activity that requires accessing the products disks. Since the compiler is on that disk, there goes our build times. One other problem that we've run into: we simply can not do two production releases and a t release in one week. d0lomite, the RH7 build machine just can't handle it.

 

Disks

On /d0dist/dist/ we have:

d0mino 205GB 3xstripped and shadowed main set, 137GB 2x stripped secondary set, 36BG for tarfiles, the main/secondary sets keep the two sorts of builds from interfering and at least one from interfering with the users

d0lxbld4 100GB RH6.1 served to d0lxbld1 (very few releases there)

d0lomite 262GB on d02ka but still have 215GB locally if we need it. Using the nfs disks slows the builds a lot (30% at least) but it makes the builds available immediately. So far, this is judged to be more important than speed.

d02ka 262GB (same disk as above) RH7.1 builds served to clued0 etc

 

Major Infrastructure products

RCP

We have fully converted to a new version of rcp which supports a multi subdirectory version of the file system database. It also dispenses with storing the release tag names in each rcp file. These changes make the file system db smaller, more efficient and more scalable. We shouldn't have any scale problems until we reach 4 billion files, an event that I don't think I'll see.

 

EDM

Just completed (on time and under "budget") was a fairly complex change to EDM to allow dropping provenances (something like chunk history) for event data chunks that are dropped. This was needed to decrease the size of the thumbnail by about 4kB per event. This has gone into t02.24.00 so it will be in p12. The changes were made in a fully backward compatible manner. The corresponding changes to WriteEvent (io_packages) to use the new interface features have not yet been done. A decent summary of the changes made can be found in my talk at 5/29/02 CPB meeting: http://d0server1.fnal.gov/users/jonckheere/cpb/EDM-status.ppt

RECO Status

Harry Melanson, June 28, 2002

Current certified version: p11.08.00  

Next version: p11.09.00  

  Possible additional updates to p11

  P12 schedule

Remote Analysis

Jae Yu, June 28, 2002

 

Status

  1. Completed and submitted the proposal for DØ Regional Analysis Center to the CPB and the DØ Management.

  2. UTA McFarm control software has been packaged and released along with the relevant instructions.

  3. Three new MC sites (Oklahoma, LTU, and Tata) are being brought up to use McFarm control software from UTA for D0 Grid July milestone of submitting and monitoring MC generation jobs to these farms via Grid tools.

  4. Two additional sites (Rice and KSU) are being setup for D0RACE.

  5. Effort to increase the activities in the sites that are already setup and to monitor performances is underway.

Plans:

  1. We plan to identify a pilot site for DØ Regional Analysis Center and demonstrate the working of the concept as laid out in the proposal. This site does not necessarily have to be a regional analysis center.

  2. We plan to bring up the other 4 new MC farm sites for continued development of D0Grid. The control software will need to adapt to more Grid tools.

  3. We will continue bring up more sites and activate them.

  4. We will improve performance and usage monitoring of the setup via SAM system.

 

Graphics

Laurent Duflot, June 28, 2002

we had a meeting today, this is the situation for graphics. I don't have a complete picture yet, though.

Level 3

Terry Wyatt, June 27, 2002

 

N.B.

In Progress with p11

Plans for p12

N.B. (*) indicates that the basic infrastructure for a particular item is already in t02.24 or an earlier release. Testing may still be in progress!

Stuff that will have to wait for p13

Text version of the L3 report from Dan Claes:


Online Monitoring.............................Michiel Sanders, Elliot Cheu

Michiel is implementing l3fanalyze as an online ntuple-based monitor. A  root macro will read out (and continuously update) the roottuple produced,  filling histograms defined by the L3 group. Elliot Cheu (Arizona) is  responsible for the macro, and has already shown plots of electron, jet  and tau multiplicity during online running. Tool authors are providing  documentation of their l3fanalyze roottuple contents and suggesting the  more useful quantities to histogram.


l3fanalyze........................................................Jon Hays

The new dynamic l3fanalyze is now ready for public use. It has been  modified to accept an ascii file listing tools and reference parameter  sets. This list can now be generated from an input triggerlist. d0tools/D0TrigSim has been modified to accept either the ascii file or trigger list. Jon is implementing an "online/offline" label, allowing on- and offline versions of the roottuple to be produced when running on real  data. This will facilitate direct comparisons of on- and offline results needed for verification. 


Primary vertex................................................Chris Barnes 

The version available in p11 (averaging the DCA of all tracks above a  specified threshold) is fully tested and ready to be exercised online. (Performance is improved using a pT weighted histogram of the z_0 of all tracks, which will be available in p12.) Purity increases with the increased threshold, and plateaus to ~.95 at 3-4 GeV for most physics processes. For Z->mu mu and W->e nu, however, the efficiency drops from about 75% to less than 50% by varying pt cut from 3 to 10 GeV. This behavior will be investigated. The proposal to run L3 jet and electron filters with the track-based primary vertex in duplicate L3 scripts has been approved.


Electron.................................................Ulla Blumenschein 

Rejection studies of the currently used ORing of tight shape+Emfr>0.9 in[Ptmin,Ptmax] || Emfr>0.9 for Pt>Ptmax, and of the individual rejection provided by Emfr, showed it may not be possible to attain the needed rejection factor of 6 and retain current trigger thresholds. The addition of a stand-alone track filter and/or track + calorimeter filter may be able to maintain efficiency for lower E_T electrons. Work targeted for p12 include central track match, eta-dependent shower shape cuts (expected t02.26) and a track-based road method.


L3 CPS...........................................Andre Turcot, Chunhui Han

The code will be run on a large sample of recent data to check stability, timing and memory consumption. Need to verify that the correct map for  real data is available. 


L3Muon......................................Martin Wegner, Martijn Mulders

L3 has been providing additional rejection on the four single muon triggers carrying L2 rejection since late May. 

Comparisons of L3 with reco local muons reveal that muons matched well  in phi have very poor matching in eta, local muon momenta agree poorly  between L3 and reco, and comparisons made using muons with matched central tracks seems to suggest L2 has better local muon momentum resolution than  either L3 or reco. The differences between online and offline local muon pT may be due to geometry differences. Martijn and Martin try to adapt code  to use the EVPACK geometry format stored in the offline database, and  consult Robert Illingworth on the L3GeometryManager to get the full  offline-quality geometry for L3TMuon.

The huge memory consumption by events with large number of hits  (controlled by .rcp mods introduced into p11.05) turned up a bug in  L3TMuon: multiple hits in the same layer can be assigned local x values differing by ~1 mm. A fix will go into p11.08. 

Work verifying the central track match tool is targetted for p12.

 

Global Tracker.........................Daniel Whiteson, Robert Illingworth

The 3-GeV pT cut introduced in the proposed global_CalMuon6.10 proved less efficient (in matching to stereo) than with the 0.5-GeV threshold used in earlier studies. This was due to the narrower roads used for higher pT  cuts. Code mods by Daniel widening roads was released in time for a special tracking run. Additionally Robert updated the cft stereo map and  smt geometry files to fix the poor phi asymmetric stereo track-finding efficiency. 

200k events were collected in a June 1st special Mark&Pass run (requiring a single track p_t>3 GeV) on the single electron and muon triggers, and  turning tracking on in the tau filters. L1 central muon triggers were missing. About 10% of CFT axial layer B was missing. About 1/6 of SMT was missing.

The results of this test run are promising. The filter produce exactly the same results online and off. Requirement of a track over 5 GeV reduces  rate of CEM(1,10) by a factor of 2.5 and wide muon by a factor of 10. For the muon case a factor of 30 can be achieved with pt cut of 7 GeV.  MC studies indicate efficiency is high (in fact better than GTR). 


CFT-only tracking.......................Ray Bueselinck, Robert Illingworth 

Work proceeds in updating the tool's certification for online running with the latest data, as well as improving speed and efficiency (particularly  in stereo tracking). A quantitative performance comparison will be made  with global tracking.


l3Tau..........................................................Yann Coadou 

Yann is updating code (changing the track extrapolation method, allowing  cut on the # of tracks). This will introduce additional tunable  parameters (under .rcp or trigger list control).

Calorimeter non-linearity and L3NADA(hot cell killing).........Marumi Kado

With the new features (non-linearity and NADA) switched off, the new  calorimeter unpacker provides exactly the same results as with the earlier version. 

L3 missing ET tool..............................................Lee Sawyer 

Comparisons of L3 and offline missing ET tools have been made using Monte Carlo samples: QCD p_t>10, QCD p_t>100 and ttbar. The performance looks reasonable except for L3 scalar ET, which shows a large (~80 GeV) offset. This is probably caused by incorrect calorimeter weights. Work directed  toward p12 includes studying the effects of calorimeter non-linearity and hot cell killing.

An updated posting of current work and detailed near term plans can be accessed at: http://www-d0.fnal.gov/computing/algorithms/level3/status.html

 

 

Simulation Status

Qizhong Li, June 28, 2002

 

Generators:

Herwig is updated to v6.4 in the test release, which will be in p12 production release. A fix put into p11 to deal with a fairly rare occurence in Herwig, where the SOFT interaction produces heavy mesons.

After t02.22.00: New generator executables:

MCpompyt.x, MCpomwig.x,MCscipyt.x: simulation of forward diffraction

MCstdhep.x: convert files in stdhep format to d0om-dspack format.

MCAnalyze_x can now produce either a root tuple or a root tree.

 

D0gstar:

Muon geometry is updated in the test release, which will be in p12 production release. A test is done with putting GEANT geometry into ROOT format, which should speed up the access to GEANT geometry. So that the D0gstar processing CPU time will be able to be significantly reduced. Unfortunately the currently developed ROOT-geometry package could not handle one of the volume shape D0 is using. The ROOT group will implement this shape and get back to us.

 

D0Sim:

The pileup package in D0Sim now has the calorimeter non-linearities in. It is in p11.09 production release now. No changes to D0Raw2Sim in June. Need a muon person to fix the crashes in muon code for D0Raw2Sim.

 

PMCS:

A lot of recent efforts are spent debugging of the code which generates D0 physics objects chunks inside PMCS and global tests of the code. Actually a lot efforts are spent debugging reco_analyze packages. There are progress on comparisons of PMCS results with RECO results on electrons and jets. Other objects comparison could not get done because of muo_analyze and met_analyze problems. PMCS group needs to have one representative from each particle ID group

in order to improve the communications between pmcs and ID groups, and to be able to tune the parameters for each physics objects.

 

D0TrigSim: (Dugan)

The current version of D0TrigSim is p11.08. This version has been run on hundreds of thousands of MC events with very few failures spotted (one failure in l3muon analyze). A new trigsim certification page now exists at http://www-d0.fnal.gov/computing/trigsim/cert/trigsimcert.html which contains standard sets of plots for each production release as well as the macros used to create the plots. The most serious problem with d0trigsim stability has been the rapidly changing L1/L2 calorimeter trigger. Pedestals, gains and eta coverage have all radically changed in the last month. New data taken since these hardware changes should be dealt-with by default in p12.00.00 (next week). To read new data in p11 requires at least p11.09.00 and some rcp flags must be changed. In L3 the most persistent problem has been filling the analyze branches corresponding to the running trigger list. The list of tools which would appear in the ntuple needed to be set by hand and changed for each trigger list. This caused a lot of confusion among users. A solution has been implemented and tested and is available in p11.09.00 and in the test release. Documentation will soon be sent to users.

 

 

Databases:

a) Offline Database Administration.

The offline databases continue to be monitored and support given to new applications table space design and implementation. Small changes were made for SAM, Trigger Database, Runs Quality; Ursula and Eric were helped with there table space design and implementation. DBA support for the SAM databases - D0 and CDF - will be reported under the SAM project.

b) CPS.

A focused effort was made in the last week of June to bring the CPS application to production and used by Reco. (I don't quite know if it made it - Eric?)

c) Database Servers.

There has been some continued instability in database servers hosted on the Linux box. A weekly summary of database server crashes will be posted. The first one is at http://d0db.fnal.gov/d0dbsrv/d0dbstatus/d0_db_weekly_jun28.html A code review of the Runs Configuration Database application was held - Jeremy, Herb, Taka and Steve as reviewers. A proposal to close the client connection after each query was made, and for testing of the application using the new database server infrastructure. The database servers are being moved to be hosted by D0mino to see if the rate of crashes diminishes.

d) Trigger Database:

Significant improvements have been made to the trigger db server with the help of John Weigand. The changes are documented in Release Notes available at http://www-d0.fnal.gov/d0dist/dist/packages/trigger_db_server/devel/doc/ReleaseNotes_020611.txt Work is starting on the changes in the Trigger Database and Application needed to support Streaming. Additional documentation has been added to the Entry Interface help pages as well as a new document that explains the rules of the trigger database and the basis for those rules: http://www-d0.fnal.gov/~gallas/d0_private/trig/TDB_Rules.html An additional document is in development on the relationship between the Trigger Database, the Luminosity Database and the Run Summary Database: http://www-d0.fnal.gov/~gallas/d0_private/trig/Run_Summary.html

e) Run Quality Database:

The Runs Quality database application is in production. Stefan Soldner is learning the infrastructure in order to provide ongoing improvements and maintenance. It is understood some new sam dimensions will be needed to support the use of the information.

f) Database Application NextGeneration (DAN)

Much progress was made on the first release. An MS project plan was made for converting the database applications to the new infrastructure, testing and deployment. This is reviewed weekly. Margherita Vittone, Jeremy and Taka are converting and testing calibration, luminosity and other database applications. Steve has converted the SAM application, which does not use the caching features, and this is now in production. Jim Kowalkowski continues to give technical design guidance and oversight. He will continue this for the implementation of the disk cache and proxy services, while the deployment and testing will fall more on Steve, Margherita and Taka. Steve will be on vacation for 2 weeks in July which will slow progress somewhat. There is an immediate need to have a development Linux machine for testing of new versions of the database server in development. Currently there is only a production machine.

g) The work on streaming is under the Analysis Tools group and will not be reported here.

 

h) Calibration database applications and browsers. I would recommend a follow up to the workshop on database browsers and applications sometime

in the next few months.