Computing and Software Status

September 2001

  • Algorithms

  • RECO
    L3
  • FNAL FARMs
  • Graphics
  • Infrastucture
  • Online
  • Remote Production
  • SAM
  • Databases
  • Simlation


  • Algorithms

    RECO

         Harry Melanson
    The status of RECO in the p10 production release is available at

    http://www-d0.fnal.gov/computing/algorithms/status/p10.html

    I have not yet asked the individual groups to send in their own reports.
     
     

    L3

    L3 as usual slipped through the cracks--Dan asked for status report late
     
     
     
     
     
     
     


    FNAL Farms

             Mike Diesburg
    Farm Status,  Seprtember 2001

    Running Problems:

            Farm systems have been experiencing significant degradation
    of throughtput for a couple of weeks due to various problems
    associated with storing the output files into SAM.
            Root cause of he problems seems to be unreliability of
    Mammoth hardware.   This has led to backups in output
    queues, filled output buffers, and loist store requests.

            To help alleviate this we have widenened the farm output
    file family to 4 and  installed more robust versions of
    encp.
            We are still experiencing occasional storage failures
    with a "Broken pipe" error.    This is a network error of
    unknown origin.   The error rate for this is at the few percent
    level.
     

    Production Version Status:
            Farms are still running t01.56.00 as the standard
    production exectable.   Version p10.04.00 recocert sample has
    been run through the farms.
            Many of the failed reco_analyze jobs have been tracked
    down to what appears to be a crash in the exit code of
    reco_analyze.    Harry Melanson is actively exploring ways to
    debug this and the possibilities of a temporary patch that
    would allow the root files to be used.
     

    Shift Training Status:

            An initial training session was held during the
    collaboration meeting where basic farm operations were
    covered as well as SAM shift operations.
            We expect to start one or two farm shifters
    within the next two weeks doing basic job submission chores
    and forwarding problems to resident experts.
     

    New Farm Nodes:

            Current delivery date for the 32 new farm nodes is
    Oct 2nd.    One evaluation sampleof the nodes has been delivered
    to the farm group prior to shipment for verification.
     
     


    Graphics

  • D0Scan/OpenInventor:

  • The latest Fermilab KAI 4.0 release of TGS OpenInventor appears to work
    and packages relying on it should produce usable executables as of t01.60.00.
    George Alverson's current to-do list:
    1) Mag field from DB (may work, needs check)
    2) add'l Chunk IDs for GTracks (have IDs and sample code from Sherry Towers
      and Valentin Kuznetsov, but I'm not about to add an rcp for this
      although everyone seems to feel it requires it)
    3) move SAM support back into D0Scan/d0scan_qt from the _data versions
    4) generate d0tool for graphics package: no progress yet
    5) update docs
    6) Online...

    The FSU group continues to work on prototype 2-D classes for D0Scan.

    Online D0ve crashes on SMT unpacking. Experts are being consulted.
    Recent changes to muon segments can not be retrofitted to online T1.51
    and await installation of version T1.58 or higher.
     
     


    Infrastructure

    During the last month we have continued to make regular "test" releases, one per week with both debug and maxopt versions of each on Linux and IRIX.

                    frozen
      t01.55.00     Aug 10
      t01.56.00     Aug 18
      t01.57.00     Aug 27
      t01.58.00     Sep  4

    Production Pass Releases:
                    frozen
      p09.07.00     Aug 17
      p08.13.00     Aug 21
      p09.08.00     Aug 24
      p10.00.00     Aug 24  (identical to t01.56.00)
      p09.09.00     Aug 27
      p09.10.00     Sep  4
      p10.01.00     Sep  6

    NT: This may be the final NT build until we do the wrapup one to archive the
    absolute final version. They are currently doing "local" quick builds trying to
    get something that works, recognizing that this data isn't and never will be
    publishable for Physics.
                    frozen
      pnt09.08.00   Aug 23

    OSF:
      onl01.58.00   is being done now.

    Linux, RH7.1
    We have been attempting to begin RH7.1 builds as well. We have two machines,
    d0lxbld9 and 10 running RH7.1 mounting the /d0dist/ disk r/w from d02ka. This
    isthe disk that is exported to the clued0 cluster. Apparent nfs problems have
    prevented this from working. The build machine has "hung", perhaps taking down
    d02ka every time we've tried it. We are working with the sys-admins trying to
    solve these problems, but so far without much success. We have gotten one build
    done, building on a clued0 machine. But even it hung on the first try. NOTE:
    there still is no official FermiRedHat Linux 7.1 version available. So a lot of
    this is "cutting edge".

    Build Resources:
      Memory
    We have hit a "wall" on the Linux build machines. Executables have gotten so
    large that if we get more than a couple being linked on d0lxbld4 (1GB of
    memory)
    at once, we'd begin swapping which slowed the builds so much that it took a

    *very* long time to build and sometimes failed entirely.

    The memory on d0lxbld4 has since been doubled, but we still can't do the 4-6
    builds simultaneously that the release schedule requires. We have been unable
    to
    use any of the other d0lxbld machines either. They have less memory and fewer
    processors, but the big problem is that they need to nfs mount the d0lxbld4
    /d0dist/ disks. nfs on Linux is "buggy" enough that we *always* get several IO
    errors during a build. Depending on exactly when those errors occur, the entire
    build or only a portion of it is junk. In the worst case, close to being the
    usual one, enough of the build is junk that the entire thing has to be
    restarted
    from the beginning.

      Disks
    In the last month we have doubled the disk space available to our builds on
    Linux (to 200GB). We now need to add a significant increment on IRIX which is
    already at 170GB. The builds on IRIX are 20-30% bigger than on Linux. In
    addition it is the "master" distribution node. So we *must* have all the builds
    there, including OSF, NT etc. These aren't large, but they add up.
     
     

     

    Remote Production

    Current Farm Software:
      mcp07 is running on all farms and a major effort is underway to
      generate samples for the Run IIb trigger studies. So far 948750
      events have been generated and their reco files submitted to into
      SAM (11 September 2001,
      sam translate constraints --rpn='file_name reco_%mcp07%').

     There are some problems with recoanalyze which will be addressed
      with a new release.

    Software Releases:

        p09.10.00 - d0gstar is broken leading to segmentation faults on
           Linux machines. Until this is fixed we cannot run p09 on the
           farms. mc_runjob has been configured to work with p09.10 and is
           ready to go. All other pieces of p09 appear to work.  The 500 k
           trigger request cannot start until this is complete.

       Runtime Environment - Jonathan hays has a proposal to rework the
           DZero runtime environment to make it more portable. David
           Ritchie (Fermilab) is going to help upgrade packages to support
           this new idea.  This is being done in conjunction with
           mc_runjob and d0tools
    Software development

        GRID - On Wednesday 12 September will have working meeting to
           examine interfacing grid tools with SAM

        mc_runjob - Need to integrate sam executables into mc_runjob.
     
     


    Online:

    *** Online monthly report, 19-Sep-01 ***

    Hardware activities:
      - Racks and nodes for L3 Linux farm are here.  There are some issues
        getting appropriate power to the racks, and there will likely be a
        kluge solution until electrical work can proceed during the October
        shutdown.  Some parts (network patch panels, power distribution) are
        missing from Linux Networx, but we'll likely go ahead with installation.
      - Some slight network upgrades in the Gb network handling VRC to L3 farm
        traffic.  There is now a Gb uplink to the 6509.
      - 8 new Linux nodes for Examines, etc are > 3 weeks late.

    Software activities:
      - Lots of work on security and ACLs, slowly and painfully converging;
        will have shutdown activities on Kerberization and "disaster recovery";
        making sure all applications properly configured to work within ACL
        restrictions.
      - Steady improvements to controls applications and diagnosis of sticky
        problems with 1553 (which may have hardware origins);  pushing ahead
        on alarm system

    Software concerns:
      - Have no one to guide and direct Examine efforts, nor to understand the
        low level issues.  Need to use trigger selection capabilities.
      - Graphics appears to have little direction and momentum.


    SAM Data Handling System

    Status:
    We are trying to finish the current work toward a v3.2 sam release soon.
    This will include new disk cache accessibility, and many other features and fixes. This is
    being tested and might be ready next week, althought there have been some problems getting everything
    setup correctly. It was discussed in the ORB to increase the SAM cache to 12 TB, we do not see this as a
    problem, the cache is currently at almost 5TB. We had our first organizing meetng for new sam shifters
    on Tuesday. There are 11 in Euopean time zones and fewer in US time zones, so the coverage will be somewhat
    skewed to the early morning hours. Nevertheless, this will help our development team. There is a lot
    of interest in Grid issues by the remote sites, and especially those with sam stations deployed. We have a
    serial root interface modeled after SAMManager used in the d0 framework. Completion of the Montecarlo
    request package has been delayed and we need to make a crash program to complete this because
    David Evans seems to be losing enthusiasm. More help in this area is needed. The dataset editor
    dimensions have been expanded to include parameters in the run config tables, now available offline.

    Problems:
    1. Stu has had to open up ACLs for 131.225.222 subnet  to allow packets
    from d0ora1 that are coming through the wrong interfaces to the onlinesystem. We are trying to find and fix this.
    2. We have recently had problems with project masters not calling back
    consumer clients on central analysis. Igor looked at this yesterday, I
    am not sure what he found.
    3.Although we planned and coordinated the python v2.1 upgrade, there
    were some problems which have been worked out. There still seem to be
    some problems with 2.1a.
    4. usual noaccess volume issues.
    5. Some issues with contention for tape drives among online, farms, and
    other offline users has been observed

    Plans
    1.We hope to have the 3.2 release ready by the end of Sept.
    2.Some work will be required to move to LTO's for storing MC data, and
    9940's for detector data. We hope to transition the MC asap, determined
    by when the tapes and robot bins arrive and are installed in the robot.
    The 9940s will certainly be ready for data taking after the shutdown.
    3.Our next major release will be scheduled for sometime in Mid
    November.  Among the things beingdiscussed for this release include
    a) transitioning to OmniORB to
    replace fnorb, the python module,
    b)new features to trace the life cycle of data set definitions and
    datasets for each group,
    c) a new command
    parser that has better help included.
    4.Work on displays for SC2001 should provide some neat visual views of
    the operation fo the system that
    will be fun to watch, and probably useful to detect problems.
     
     
     
     

    Databases

    Lee to summerize "taking stock meeting"

    Simulation

    Simulation Status (Sept., 2001)
    ===============================

    D0gstar/D0sim:
         Mainly working on p10 production release.
         Changes in D0gstar for p09 and p10:
             - All detectors,including LUMI and FPD.
             - New 3D magnetic field map
             - New interface to access to the field configuration.
             - Default ECUTs for calorimeter plate level geometry were
              lowered to get more accurate response.
             - A new option to swim very forward p/pbar to FPD in double
              precision was added. This option is OFF as default.
             - New access methods on the option in event information.
             - Changed from GEANT 3.21.12 to  GEANT 3.21.13 (Geant bug fix).
          Changes in D0sim:
             in p09.10.00:
             - Turned on smt noise simulation,
             - use cal only optimized weights in L1CalTTChunk.
             - Handle both mixture and plate geometry.
             in P10:
             - CalDataChunk uses cal only optimized weights
             - added code to handle pseudo SimChunks.
         To test p10 release, a set of MC samples are generated through
         D0gstar and D0sim. The MC verification sample include
          Z->ee, mumu, tautau, ttbar and single electron. We have generated
         20 files with mixture or plate geometry, and with 2.5mb or 0mb
         separately.
         CPU study and output size studies are done on p10 verification
         samples.
         All these files are tested with mc_examine and with each
         subdetector examines. The check on RawDataChunk looks ok.
         We certified that p10 d0gstar/d0sim can be put on the MC farm.
         A lot of efforts put into the comparision of the mixture
         and plate Monte carlo.

    PMCS:
        The general framework of PMCS is in place and the current work
        focuses on providing the  smearing functions for the various
        physics objects.
        Major effort now to put pmcs output into d0phyobj chunk, and to
        simulate the variables in chunk properly.

    D0TrigSim (by Oneil)
         P10 contains a lot of big changes for d0trigsim infrastructure.
         For example L1/L2 changes to run on real data:
             - l1l2unpacker learns VRB format.
             - more analyze packages read directly from RDC (l1cal, l1muon,
              l2caljet, l2calmet, l2calem, l2cps, l2gbl).
             - tsim_l1ft completely rewrites L3 output classes to better
              emulate real data. New unpacker to be written for l1ft analyze
              package and examine.
         More functionality added to l2gbl (track match), L3 filter/tools,
         L2errorlogger integration, etc.