Computing and Software Status

August 2001

  • Algorithms

  • RECO
    L3

  • FNAL FARMs
  • Graphics
  • Infrastucture
  • Online
  • Remote Production
  • SAM
  • Databases
  • Simlation


  • Algorithms

    RECO

         Harry Melanson
    Available versions of RECO Schedule: Current status:

    o Real data processing
    t154 is running on the farm. It uses the correct CFT geometry, and has improved real data track finding paths. Fixes for memory leaks in CellNN are included. Also, SMT software deals with abnormal occupancies by truncating cluster lists. SMT sparse mode readout is not supported until t155. Unfortunately, t155 is unable to process real data, due to a last minute error in calorimeter code (fixed in t156).

    o Monte Carlo processing p08.12.00 is being installed on the remote farms. It suffers from two large memory leaks in reco_analyze. These have been fixed, and p08.13.00 will be built as soon as disk space / release manager resources are available.

    o Upcoming production release p10.00.00 will be based on t01.56.00. After the first build of t156, the d0reco executable existed and passed its MC integrated tests. The real data test failed due to a RCP error (under investigation). Tests of t155 give (d0mino, maxopt, Z(mm) + 2.5) CPU time: 85 sec/event, mem usage: 654 Mb (1K events). The CPU time is a little slower than expected (may be ok), and the memory usage indicates probable memory leaks. Memory leaks are under study, however purify crashes (experts informed). Performance on Linux to be measured. New reco certification samples using p09.08.00 are being generated for p10 certification (12K events exist so far).

    Other outstanding issues:

    1. There is a report from the muon group of no high pt tracks when processing p09 reco certification samples with t155 (under investigation).
    2. Code is ready to access offline data base to get run time configuration (magnet polarities). Needs to be enabled.
    1. The calibration group has demonstrated the ability to read all SMT constants from the database. Current performance: requires 100Mb and ~ 30 minutes for all constants. This is deemed acceptable for version 1, although memory requirements should be reducible. Work will continue. Currently group is working on going to production, and expect to enable calibration db access on the farms in early September.
    Summary:
    p10 will be the first RECO production release that will support real data processing, and will be used to process data for the Commissioning Workshop (August 31, 2001). It will also be the first version to access the calibration database, produce a Thumbnail chunk (incomplete), and officially support processing of MC samples generated within the same release. It appears to be on schedule. There will be performance issues (CPU time and memory usage) that will have to addressed. This will be the main focus for the p11 production release.

    L3

    July 2001 Level3 Monthly Report
    ===============================
    General
    Status.............................................................
    Summer plans continue focusing data collection efforts on p10 (with nt-p10
    rebuilds done only to fix bugs causing drastic performance problems) and
    expanding the online monitoring & offline verification efforts.

    Manpower Changes...................................................
    With her new appointment as co-leader of the D0 Computing and Software
    Project, Amber Boehnlein steps down from her role in Level3.  Moacyr
    Souza, (LAFEX), the architect of ScriptRunner, steps in as L3 Filtering
    co-leader.

    Imperial student Ian Blackler begins work on a filter for messages sent to
    the Significant Event System, in hopes of controlling the number forwarded
    to the alarm server.

    Daniel Bloch and Francois Charles (Strasbourg) are studying a 3D vertex
    finder based on similar work Francois has worked on for CMS.

    Online Testing.............................................Jonathan Hayes
    The following are already on ONLINE nodes and have been/or will be tested:
                    Calorimeter: unpacking, clustering, jet finding
                    SMT        : unpacking, SMT-only tracking
                    Muon       : unpacking, coincidence filter
                    CFT        : unpacking, tracking and global tracking

    Timeouts in iniitializing (and backing up when SESystem que fills) slowed
    early summer work.  May need to stop blocking ScriptRunner whenever
    SEServer is down.

    4 basic triggerlists exercise the current online needs of the exe:
    SMT tracking, MuoLocal, Muon coincidence, and Jet reconstruction.

    Initial tests on Author node have shaken down SMT code on data samples
    that include p05 QCD, and 1x8 store collider data.  All 4 tools must be
    tested (looking for reasonable output), possibly with some profiling and
    RCP-tuning.

    Transfer to Linux needed for l3fanalyze runs.
    Online profiling (with an optimised build) of muon unpacking, calorimeter
    unpacking, muon coincidence filtering, calorimeter clustering, and the
    jet tool should follow.

    Tool Reports
    ------------

    Jet........................................................Volker Beuscher
    Little/no feedback from physics/Id groups leaves development still
    directed internally with L3.  This tool is running on data but still needs
    its large statistics shakedown. EM functionality is being added to the
    Jet tool for short term data collection plans.  The earlier (p06) studies
    by Andre and Sailesh should be revisited.

    SMT/CFT unpacking/clustering............................Robert Illingworth
    A DB-script now extracts the SMT cable map from the database.
    Timing results from tsim_l3 runs (on Z->bb+2.5 min bias reco certification
    sample) give 21msec(sigma=7.8) for SMT unpacking and clustering (d0mino).
    For comparison: 26msec(1.6) for the (non-zero-suppressed) 36x36 data.
    Time scales beautifully with the number of clusters, as expected.
    The CFT timing distributions showed a tail (w/overflow) with a 23.12msec
    mean. However, the times did not cleanly scale with clusters, which is to
    be investigated.

    CFT Tracking................................................Ray Beuselink
    Regional tracking is not yet functional, and will require fairly extensive
    reorganization of the way lists of clusters and links are built/stored.
    Ray is studying premature hit assignments (to candidate tracks) made in
    complex events to improve the reliability of axial tracking. Modifications
    are still being tested and debugged (and not yet in CVS).

    Global/SMT Tracking.......................................Daniel Whiteson
    L3 global and SMT only tracking is running (offline) on real data. Daniel
    has shown the 1st reconstructed gloabl track (run 119679,evt=231507)

    SMT-only histograms (>~10000 events) are being used to study the hits per
    track (~3900 with 4 hits, ~400 with 5 hits, a few w/ 6-7 hits),
    chi-squared fit, distance of closest approach, phi-0, Q/pt, Z0,
    tan(lambda).

    Tracking Efficiency Standards............................................
    While the efficiencies and rejection determined by fully simulated physics
    triggers will have the greatest utility to users (and triggermeisters), we
    also need to document the individual performance of complementary tracking
    tools (to understand the rates and select the tool appropriate for
    specific filters).  Daniel and Ray have been converging on common
    definitions for reporting purity and efficiency, and Daniel has written a
    set of tools to calculate these measurements.

    Primary Vertex..............................................Ray Beuselink
    L3TfastZVertex uses L1 luminosity bits (providing a 6.25cm binning) and
    Runs on collider data (but not MC where the required bits are unavailable)
    L3fcft_vertex is in cvs (though unreleased) and provides z-resolutions on
    the order of a centimeter.  In principal it could be run on SMT or global
    tracks as well.

    SecondaryVertex..........................................Arnaud Dupperin
    The current algorithm finds secondary vertices in 2% of QCD events, but
    55% of Z->bb or top (at least one vertex found with a decay length at
    least 1 mm, minimum of 3 tracks with a least 4 hits/tracks) (~100 ms).
    These rough estimates do note yet include any L1/L2 cuts.

    Connecting to CFT/SMT and Global tracker reconstruction by obtaining a
    pointer to an L3TTracker and now one can now perform 3D tracking using the
    official L3 tracking. SMT clustering/hit finding is also implemented to
    compute reconstruction efficiencies.

    Daniel Bloch and Francois Charles (Strasbourg) are studying fast 3D
    vertexing, but this work should really start some September.

    CPS unpacking..................................................Chunhui Han
    unpacker runs on 1x8 RawDataChunk. Channel mapping/array sorting is fine
    - no difference observed w/results of cps_examine (in by-hand comparisons,
    which Chunhui plans to automate in order to check all channels).

    CPS...........................................................Andre Turcot
    Tool has seen several thousands of events now, without major problems.
    The schedule of AFE arrival/implementation suggests an axial-only mode
    (provide adequate resolution by itself) be implemented.

    FPS...........................................................Andre Turcot
    No unpacking yet available. Tool development awaits map and final
    hardware specs.

    Muon Hit Coincidence (MuoHitCoinc)........................Eduardo Gregores
    Has been updated to take parameters defining the layers and octants in
    to check for coincidences.

    Muon............................................................Paul Balm
    Unpacking runs regularly online.  Its speedup has not been a priority.
    In L3TmuoLocal, however, possible speedups were targetted in both
    segment-finding and track reco.  Central track-matching (timing in at
    4msec) has been released.  L3TmuoCalTrack (utility for the MTC package)
    properly identifies the expected calorimetry cells.

    Two new problems have just been uncovered relevent to the muon tool
    (both from underlying packages, not L3 code)
    1)Memory leak in the track reconstruction code (~100 MB after 100k events)
    is being looked into by the Saclay group.
    2)mis-identified electronic channels assigned to non-existing detector
    elements e.g., phi-section #7, when only 5 phi partitions are present
    2 occurrences found - being investigated by Dave Hedin and Mike Fortner.

    L3Tau..................................................Gustaaf Brooijmans
    New algorithms introduce a track-based (rather than the previous
    calorimeter-based) search.  Track clustering is Pt ordered, and single
    isolated tracks above a minimum Pt are selected.  The code  has been
    debugged, released in p10, and timing and memory use is being
    investigated.

    L3Propagator..............................................Arnaud Duperrin
    The classes (L3PropagatorManager - the interface allowing constant or non-
    uniform B selection, and L3propagatorSimple - covering the constant B
    case) have been defined.  Propagation can begin from any space point or
    DCA, and with the input of momentum, charge, and R to be propagated to,
    returns an extrapolated vector position.  This code is in cvs and in use
    by the jet tool.  Timing studies are underway, and will be followed by
    L3PropagatorFull (the non-uniform B-field method) as well as inward
    propagation.
     


    FNAL Farms

             Mike Diesburg


    Graphics

             Sharon Hagopian
    1. D0ve - Toby Burnett has implemented changes requested by shifters to have initial D0ve displays non-overlapping and sizes set by rcp parameters. Nobu Oshima is debugging Gtrack displays.
    2. D0scan - George Alverson is working on a web display. He will be at Fermilab for ~ 10 days in mid-August. The FSU group is working on utilities for 2-D D0scan displays
    3. Special Requests - Eric Kajfasz requested a special 2-D SMT display. This has been implemented by Gavin Hesketh. It will be included in the next D0ve test release.


    Infrastructure

    Jonckheere on vacation

     

    Remote Production

    Iain is just ramping up


    Online:

    Stu Fuess

    - Continuing to fine tune the Access Control Lists this week.  Expect to
      have them all in place by end of week.

    - Jerry Guglielmo working on getting Kerberos to work for Tru64 V5.  This
      is a major stumbling block to Kerberizing the Online system.  Our goal is
      to allow only Kerberized ssh external access.

    - In dire need of both EXAMINE czar to coordinate efforts, and EXAMINE
      expert to work on READ_EVENT_DAQ low-level routines.  In general the
      EXAMINEs need more attention, particularly Global and Display instances.

    - Another round of high rate tests are pending, to see if recent ITC
      improvements have gained us anything.

    - Another 8 Linux nodes due to arrive at end of month, for DAQ (separate
      off the Distributor), Luminosity, Linux/Kai build node, and EXAMINE
      use.


    SAM Data Handling System

    Lee Lueking and Victoria White
    Summary:

    The system is in heavy use now, and people are both finding and creating
    problems - which is good. Inaccessability of tapes has been a major
    problem recently. A new mode of operation of someone attempting to run on
    a linux cluster and fetch files from SAM, that could not be fetched
    because there was no disk cache on the linux cluster, managed to bring
    central-analysis on d0mino to its knees a couple of times, until this was
    tracked down.

    New features added to the system in the last release include the ability
    to fully run the Farms as a proper SAM station (which it has not done for
    the past year thereby becoming out-of-date with versions of software).
    This has been tested, but still not put into production on the Farms, so
    there may still be some issues we don't know about. Also this version of
    SAM supports a cluster of workstations of any sort, including the desktop
    cluster of ClueD0, where the disk cache is distributed throughout, but
    only one (or a few) servers designated to retrieve data from Enstore.
    This is now being put to the test on ClueD0.

    Datasets can now be defined using several attributes ("dimensions") from
    the Run Config database.

    The SAM team has a long list of little jobs to do for V3.2 release and is
    constantly busy with support and new people trying to use the system in
    new ways.  This is all good, but makes for slow progress on major new
    features.

    SAM stations are running at Nikhef, Prague and Lyon and produce MC data
    and enter it directly into the system. Also there are test stations
    running at Imperial College, MSU and Columbia.

    Issues:

    With every new station we must insist that there is a responsible station
    administrator who understands the possible tuning parameters of the
    system.  Some additional tuning parameters are urgently needed and are
    scheduled for V3.2 release in a 2 weeks - these will control the total
    number of files that can be in transfer simulataneously on a station.
    Currently this parameter is buried and on a systems like d0lxac1 where 10
    disks have been defined we see 10 transfers from Enstore starting up for
    each disk - and this kills the system.  Somehow these test installations
    immediately become 'production' facilities for users, without the person
    who installed them being the long-term responsible person.

    We are working to enable more types of data, with more flexible and
    extensible meta-data to be stored in SAM. This should be done in the next
    few weeks.
    A documentation and error message blitz is needed.

    The d0 I/O system is difficult to deal with and to inform users of
    problems with a particular file.  Files are opened -and if there is an
    error that is reported, but there is no way to report that certain files
    should have been delivered, but were not.  This needs some help from Jim
    K. to think through what should be done. I think it involves other ways
    to funnel error messages through the framework.

    A task list for the next release V3.2 is posted off of the SAM
    development web pages at http://d0db-dev.fnal.gov/sam
     

    Future Features:

    Some of the SAM team members are now starting to work part of their time
    on the Particle Physics Data Grid, with hires to help with SAM to be made
    imminently.  This work will in fact be exactly along the line of work
    that has to be done for SAM in interfacing to batch systems, dispatching
    jobs and reliably running a whole chain of jobs.  This also relates to
    Farm production code. It is extremely likely that most remote Farms will
    not use the same job management system as at Fermilab, but rather be more
    inclined to use Condor, as this meshes more with work and funding they
    also may get from European grid initatives.

    There are major pieces of work to be done in order to fully incorporate
    Trigger and Stream information and to fully support user Analysis and
    jobs.

    The MC request tracking and MC metadata must be finished and fully
    incorporated in the system.

    Of course a constant stream of small things will also need to be done at
    the same time.

    Understanding the entire distributed system, its status and examining
    failures is a difficult task and log files are hard to read.  We are
    working on a better monitoring and system information framework that will
    put all of this information into a monitoring database and allow us to
    better analyze it.  However, this does not add user functionality, so has
    to be prioritized accordingly.
     
     

    Databases

    Lee Lueking and Victoria White

    Summary:

    The Run Configuration database is now in production.

    The tests of using real calibration data for SMT calibration can now go
    forward, although the performance may still be less than desired and the
    size of the reco executable when bloated with calibration objects is
    still too large.  Many performance problems with the database servers and
    with d0om client code have been addressed, although more sophisticated
    handling of these immense in-memory caches of almost 2 million objects
    needs to be looked into. Other calibration databases remain in the same
    state as they have for many months - almost done, but need a real push to
    get them into production and working.

    The Trigger database is on a path for first release at the end of August.
    XML files are just now starting to be generated from the database.  Some
    more human interfaces need to be finished to enter parts of the trigger
    list and additional database triggers written to ensure integrity of all
    parts of the database.

    The Luminosity database design is stalled waiting for it to be filled
    directly from the luminosity monitor, instead of from loading files as it
    has been in test.  The file loading leads to small problems that have to
    be ironed out, and can only be when streaming live data into the
    database.  It is not clear if the design is adequate and it remains to be
    decided just how much of the raw online data must be archived to the
    offline database and how much can be summarized or discarded - assuming
    we keep a copy in the form of raw files archived to tape via SAM.  This
    latter process has also not been ironed out yet.

    The RCP database has been done for some time.  Hooking it up to the RCP
    package to save/restore sets of RCP parameters is still ongoing.

    The SAM database has been quite stable. There are now tables for tracking
    Farm jobs, but that part is not fully in production. Work continues on
    tables for recording and tracking MC request and physics information.
    Tables for tracking physics information of MC data have been complete for
    some time, but they are not being filled, or the data even produced yet
    by mc_runjob.

    Issues:

    Accessing calibration data still needs a lot more work.  We need to
    improve caching of data in the calibration database servers, possibly
    using calibration files written out by the servers as intermediate
    cache.  We also should consider putting these files in SAM and treating
    them as another type of file, for archive and cache management. I think
    we do not understand our access patterns for calibration data well enough
    - no-one has really been analyzing or using calibration constants up to
    now.  It is also not clear that the decision to use d0om objects for
    calibration objects and d0om persistency mechanisms was a good one, or is
    a viable long-term path.  We need to do some tests with non-d0om sets of
    calibration objects and study memory size and performance.  The database
    server and d0om code also must have cache flushing notification added to
    them.  If we need to go to non-d0om calibration objects then the
    calibration_manager code for each subsystem will have to be modified to
    support an alternative form of reading in the constants.

    The Run Config database is now in place and presumably bits of
    information from it will be needed in reco or other analysis code.  So
    the package that fetches these constants will need to be maintained and
    updated, possibly frequently.
     
     

    Simulation

    MC Generators Status
                    (S. Protopopescu)

    MCHerwig is now running properly and can be made available
    on the farms in p10.

    The programs MCSingle_x and MCCosmic_x have been moved
    from mc_runjob to d0_mcpp_gen. MCSingle_x has been
    modified slightly to handle single tau generation.
     

                    D0GSTAR Status for p10
                      (S. Kunori)

    All planned changes for p10 have been released with t01.55.00 build.

    Major changes from p09 to p10 are as follows.

    1) New 3D magnetic field map is now default. Default field orientation
    is that solenoidal field: Bz positive at (0,0,0) and toroidal field: + in
    phi
    (i.e. counter clockwise). The previous field map is still available
    as an option .OLD'.
    2) Default ECUTs for calorimeter plate level geometry were lowered.
    3) A new option to swim very forward p/pbar to FPD in double precision
    was added. This option is OFF as default.
    4) SimEventInfoChunk has additional methods which return 1) version of
    magnetic field map and field orientation and 2) FPD tracking option.
     
     
     

                     PMCS Status
                     (M. Verzocchi)

      PMCS is not yet ready for use in a physics analysis.
    The general framework of PMCS is in place and the current
    work (by several contributors) focuses on providing the
    smearing functions for the various physics objects. We
    are working on the creation of physics objects chunks
    inside PMCS (so that for example the RecoAnalyze packages
    can be run on the PMCS output) and on the creation of the
    Thumbnail. A first version of PMCS for general use
    is expected for p11.
     

    Just added an initial version of the smearing functions
    for muons, and will add during the next two weeks code
    for simulating the response of the tracking system. The
    current code for smearing electrons and photons is being
    tested. Lagging somewhat behind on jets and missing energy, and have no
    coverage for taus. The number of people currently working
    on PMCS is unfortunately decreasing, the group would benefit
    from additional help from the physics groups (particularly
    from Jet/Met and Tau, and eventually Btagging and trigger
    simulation).

      For more details, see Sarah Eno's talk
    two weeks ago at the July 25th MC meeting (transparencies are available
    at http://www-d0.fnal.gov/~sceno/pmcs_doc/temp/simulatio_mtg_24jul01.pdf)
    gives the status of each PMCS package. There is also
    a draft of a "developer's guide" for PMCS which contains a
    detailed list of the work to be done, available at
    http://www-d0.fnal.gov/~mverzocc/notes/pmcs/draft.ps
    (some thing will be changed, and some things which are
    listed as "TO BE DONE" are already "DONE").
     
     

                D0SIM Status
              (S. Protopopescu)

    p09.02 corrected bugs in calorimeter packing/unpacking (negative energies)
    p09.07 applied proper weights to calorimeter trigger towers
    p09.08 corrected bug in cps packing/unpacking

    The pileup package has been updated in t01.55.00 (for p10) to:
    1) Treat mixture and plate d0gstar output on equal footing
    2) Apply correct weights to calorimeter trigger towers
    3) Ready to merge real minbias data with MC hard scatter
    4) Ensure cal. unpacking can use same weights for real and MC data

    All digi packages have code added to handle pseudo Sim data
    (ie SimXXXChunk's that include hits derived from real minbias data)

              D0RAW2SIM Status
              (S. Protopopescu)

    The first production version of D0Raw2Sim program will be available
    in p10 but it is not yet ready for use in farms. The package to
    handle L1 calorimeter towers is missing.