From this page, you can check four important things:
Is the robot operational?
Click on Enstore
Status At A Glance . The ball next to ADIC AML/2 should be green.
If not,
and there are no notiifications to shift captains of scheduled downtimes
for the robot or
already known problems, then go to Step 2) Calling in help.
Are there still working Mammoth
2 drives?
Click on Enstore
Status At A Glance and look in the Movers section under
Enstore Individual Server Status.
The movers which control Mammoth 2 tape drives are named DIxxM2.
As long as
at least one such mover is alive (has a green ball next to its name), we
will be able to
write data at sufficient rate. (Once we go to full rate operations,
this will require at
least 2 such movers.) If the robot is alive, but there
are no green M2 movers, go to
Step 2) Calling in help.
Are data transfers to tape proceeding without errors?
(Click Back from Status at a Glance). Click Encp
history. This shows a history of the
system's transfers to and from tape, most recent first.
The ones from the online system
show up with 'd0olc-100mb-1' (or something similar) in the Node column.
If the transfers
are proceeding normally, there'll be a clickable link with a big number,
which will show you
which file names were being written in each transfer, to which tape, etc.
If there is an error,
the clickable link will say 'Error'. If there is a question
about data getting to tape properly,
you can look here (probably at the behest of an expert) to check if online
transfers show
errors. Note that you may see many many errors from the offline
systems. It is likely that
these come from the intensive test program that is trying to shake down
different kinds of
drives as possible replacements for the Mammoth 2's, so don't worry about
these.
Are tapes becoming inaccessible due to drive or robot problems?
(Click back from Encp history). Click on Tape
inventory. From the list at the top click
on NO ACCESS. (This doesn't update. If you have been to the
page before and cached it,
you'll need to hit RELOAD -- a tag at the top of the page tells you the
time it was generated.)
Scroll to the bottom of the list and look at the PRI tapes which are marked
NO ACCESS. If
more than 1 or 2 tapes come from the family D0.datalogger.cpio_odc,
that MAY indicate a robot or
drive problem. Go to Step 2) Calling in help.
AT THIS POINT,
if everything has looked normal, you can be sure the data are being written
to
tape if the
data logger is working. The next section covers whether the
data are being made accessible
from SAM .
This is necessary for the farm processing to proceed and for ordinary users
to get at the data,
but if anything
in the next section doesn't work, it is recoverable later, so is not at
this point a matter for
paging people.
Once we are at higher rates and need the farm processing to keep up in
real time, this part
will have to
be handled 24/7 as well. Meanwhile, we should keep track of problems
and make sure they
are handled
in 8/5 hours.
Are data files going into SAM and being processed on the farm?
Go to the
Sam
Data Files Query page (SAM home => Browse the SAM Meta-data
=> Data Files )
Enter
'physics data taking' for Run Type; 'raw' for
Data tier; today's or yesterday's date for Created After;
and click Run. This will show you the most recent
entries in the SAM event file catalog from the online
system. If the current or next-to-current run shows up, you can believe
the system is talking to the database
and therefore all the basics are working. Go back to the query page
and repeat the query with 'reconstructed'
instead of 'raw' for the Data tier, and you can see which
runs the farm has recently processed and stored.
2) Calling in help.
If the
robot is down or there are no functioning Mammoth 2 movers - first
of all, right now
there
is no need to panic. Once we are at high rate, the online system
can survive for about
24 hours
without losing data when the robot is down. However, right
now we can survive for
many many
days. So, if it is business hours, call the helpdesk - x2345
- and explain the problem.
Indicate
that D0 data taking is affected and the problem is urgent.
Also, send email to the
sam-users@fnal.gov
and the enstore-admin@fnal.gov
mailing lists indicating that the robot
or the
Mammoth 2 drives are all down. If the commissioning coordinator
authorizes more
urgency
(for example, if we are straining to reconstruct a particular data set
over the weekend
and the
robot or tape drives all disappear), calling 2345 and following instructions
to reach the
on-duty
operator in the Feynman Computing Center to request that the ENSTORE
primary be
paged
is allowed - but let's not use this unless we really need it.
Exception
to the above soothing words: If many recently written tapes are marked
NO ACCESS,
this COULD indicate a serious robot/drive problem. It is UNLIKELY
that we'll see it here first,
but if it is a weekend, and you see that many tapes from the datalogger
family have gone NO ACCESS,
(and there's no notice in shift instructions to the contrary because of
known problems), then call
2345 and have the ENSTORE primary paged to look into this.
If
the SAM queries aren't working, send email to the sam-users@fnal.gov
list and request the
SAM shifter to look into it. There is a SAM shifter 7 days a week,
but not 24 hours a day.