Shift Captain or DAQ Expert instructions to check health of the data handling system (SAM and ENSTORE)

Last Updated 1 July 2001  -  maintained by  Wyatt Merritt

1) Click on  the ENSTORE Status page   (From D0 At Work => SAM (Data Access) => Diagnostics => Enstore Status)

      From this page, you can check four important things:

        Is the robot operational?

             Click on  Enstore Status At A Glance . The ball next to ADIC AML/2 should be green.  If not,
             and there are no notiifications to shift captains of scheduled downtimes for the robot or
             already known problems, then go to Step 2)  Calling in help.

       Are there still working Mammoth 2 drives?
                Click on  Enstore Status At A Glance  and look in the Movers section under
               Enstore Individual Server Status.

               The movers which control Mammoth 2 tape drives are named DIxxM2.   As long as
                at least one such mover is alive (has a green ball next to its name), we will be able to
                write data at sufficient rate.  (Once we go to full rate operations, this will require at
                least 2 such movers.)    If the robot is alive, but there are no green M2 movers, go to
                Step 2)  Calling in help.

        Are data transfers to tape proceeding without errors?

                 (Click Back from Status at a Glance).  Click   Encp history.   This shows a history of the
                  system's transfers to and from tape, most recent first.    The ones from the online system
                  show up with 'd0olc-100mb-1'  (or something similar) in the Node column.   If the transfers
                  are proceeding normally, there'll be a clickable link with a big number, which will show you
                  which file names were being written in each transfer, to which tape, etc.  If there is an error,
                  the clickable link will say 'Error'.    If there is a question about data getting to tape properly,
                  you can look here (probably at the behest of an expert) to check if online transfers show
                  errors.   Note that you may see many many errors from the offline systems.  It is likely that
                  these come from the intensive test program that is trying to shake down different kinds of
                  drives as possible replacements for the Mammoth 2's, so don't worry about these.

         Are tapes becoming inaccessible due to drive or robot problems?

                  (Click back from Encp history).  Click on  Tape inventory.    From the list at the top click
                   on NO ACCESS.  (This doesn't update.  If you have been to the page before and cached it,
                   you'll need to hit RELOAD -- a tag at the top of the page tells you the time it was generated.)
                   Scroll to the bottom of the list and look at the PRI tapes which are marked NO ACCESS.   If
                   more than 1 or 2  tapes come from the family D0.datalogger.cpio_odc, that MAY indicate a robot or
                   drive problem.    Go to Step 2) Calling in help.

          AT THIS POINT,  if everything has looked normal, you can be sure the data are being written to
          tape if the data logger is working.   The next section covers whether the data are being made accessible
          from SAM .  This is necessary for the farm processing to proceed and for ordinary users to get at the data,
          but if anything in the next section doesn't work, it is recoverable later, so is not at this point a matter for
          paging people.  Once we are at higher rates and need the farm processing to keep up in real time, this part
          will have to be handled 24/7 as well.  Meanwhile, we should keep track of problems and make sure they
          are handled in 8/5 hours.

          Are data files going into SAM and being processed on the farm?

           Go to the Sam Data Files Query page   (SAM home => Browse the SAM Meta-data => Data Files )
           Enter 'physics data taking' for Run Type;  'raw' for Data tier;  today's or yesterday's date for Created After;
            and click Run.    This will show you the most recent entries in the SAM event file catalog from the online
            system.  If the current or next-to-current run shows up, you can believe the system is talking to the database
            and therefore all the basics are working.  Go back to the query page and repeat the query with 'reconstructed'
            instead of 'raw' for the Data tier, and you can see which runs the farm has recently processed and stored.

2)  Calling in help.

           If the robot is down or there are no functioning Mammoth 2 movers -  first of all, right now
           there is no need to panic.  Once we are at high rate, the online system can survive for about
           24 hours without losing data when the robot is down.   However, right now we can survive for
           many many days.   So, if it is business hours, call the helpdesk - x2345 - and explain the problem.
           Indicate that D0 data taking is affected and the problem is urgent.   Also, send email to the
           sam-users@fnal.gov  and the  enstore-admin@fnal.gov  mailing lists indicating that the robot
           or the Mammoth 2 drives are all down.    If the commissioning coordinator authorizes more
           urgency (for example, if we are straining to reconstruct a particular data set over the weekend
           and the robot or tape drives all disappear), calling 2345 and following instructions to reach the
           on-duty operator in the Feynman Computing Center  to request that the ENSTORE primary be
           paged is allowed - but let's not use this unless we really need it.

            Exception to the above soothing words:  If many recently written tapes are marked NO ACCESS,
            this COULD indicate a serious robot/drive problem.  It is UNLIKELY that we'll see it here first,
            but if it is a weekend, and you see that many tapes from the datalogger family have gone NO ACCESS,
            (and there's no notice in shift instructions to the contrary because of known problems), then call
            2345  and have the ENSTORE primary paged to look into this.

            If the SAM queries aren't working, send email to the  sam-users@fnal.gov  list and request the
            SAM shifter to look into it.  There is a SAM shifter 7 days a week, but not 24 hours a day.