Simple VMS System Diagnostics
This document is meant to help an unskilled and non-privileged user discover
various facts about the VMS computer they are using, it's operation, resources
and status. In particular, it is meant to allow a user to find out enough
information about their machine so that they can discover why they are having
problems, or, at least, give sufficient information to a system manager so that
they can figure out why the user is having problems. In either case,
figuring out what's wrong, if anything, goes a long way to fixing the problem.
There are a number of tools provided by VMS to help diagnose system problems.
Many of these are available to a non-privileged user. I will, first list the
most useful monitoring commands, then go into them in more detail and finally
go through some examples of how and why they'd be use.
Commands
Examples
Commands
-
SHOW SYSTEM
shows the processes on your
machine (system). Can be qualified by /INT (interactive), /BAT (batch) and
/NET (network) to pick out these types of jobs, /FULL to display the owner of
the job and by /NODE=node_name to look at another machine. The interesting
display items are:
- PID, Process ID, used in later commands;
- NAME, process name: For example DECW$TE... are DECWindows terminal
emulators, created by people loging in via D0$DECW:REMOTE or several other
ways, but not via SET HOST or TELNET, BATCH_nnn are batch jobs, etc.
- STATE, anything but CUR, LEF, HIB, COM reveal potential problems, in
particular RWAST (Resource Wait, AST) is bad. Almost any RW
state isn't good. PF* are jobs waiting for the Page File. See
SHOW MEMORY
- PRI, current priority (note, goes up and down with time due to the
scheduling algorithm.
- IO, number of IOs from process, can find IO hogs here;
- CPU, cpu time used, lot's of CPU in interactive jobs, especially if
they are at priority 4 may take some persuasion (with a small club?). Long
jobs should not be run interactively.
- PAGE FLTS, Page Faults, ie number of times the system had to switch
execution pages into your working memory, these are mostly not to
disk, so large numbers aren't necessarily a problem;
- PH.MEM, Physical memory, current number of 512 (VAX) or 8192 (AXP) byte
physical memory pages being used by the process.
-
SHOW PROCESS
shows details of a
particular process (easiest to use by specifying the PID (/ID=pid) found via
SHOW SYSTEM. The default is the current process. SHOW
PROCESS/ALL shows all information available. There are many other qualifiers
to get just the information you want. Of more interest is /CONTINUOUS to get a
continuous display. NOTE: in most cases you can only look at your own
processes this way.
-
MONITOR PROCESS
continuously monitors the
processes on your machine, or with /NODE=node_name, another machine. Without
switches is a lot like SHOW SYSTEM but is continuous.
- /TOPCPU Gives a graphical display of the top CPU users on your
machine: /NODE=node_name gets the same data from another machine.
- /TOPFAU shows the processes with the most page faults, NOTE: large
numbers are not always a problem.
- /TOPBIO show top buffered IO
- /TOPDIO shows top direct IO
-
MONITOR SYSTEM
continuously monitors many
of the same things the MONITOR PROCESS command does, but on one page and only
shows you the total and the single top users. However, it does show the
page total page fault rate, including an indication of the page to disk
rate. Within the total (top) page fault rate bar, there is a vertical line. To
the left of that line is the rate to disk, to the right is the rate to memory.
Paging to memory is not, normally, a problem. It's fast and safe. Paging to
disk is slow and high (more than a few per second) is to be avoided at
almost any cost (as in $ for more memory!). Several hundred faults per second
is fairly normal (bar has a max of 100 so it'll usually be pegged). So if the
vertical line is away from the left edge, there is a possible problem. Reduce
the number of big, running jobs, or buy more memory. NOTE: Session managers,
Terminal Emulators, Clocks, MAIL etc. especially DECWindows versions take a
significant amount of physical memory, and even more on the page file.
-
SHOW DEVICE [device]
without a device name
shows all devices known to the cluster.
- Device name can use a form of wildcards, give the first few characters
of the device name or type and it'll give you all devices that match. In
particular:
- "SHOW DEVICE DK" shows most disks in the cluster. Disks are usually
DK, DU or DS. On FNALD0 all of them are DK.
- "SHOW DEVICE node_name$DK" shows disks on system "node_name".
- "SHOW DEVICE MK" will show you tape drives, but only on the
machine you're on, the cluster doesn't know about tape drives.
- The device name can be any logical that translates to include a device,
so "SHOW DEVICE PRJ$ROOT249" would show the status of the PRJ$ROOT249.
- SHOW DEVICE can be qualified with /FULL to get a full listing of the
device's properties. For a disk this includes not just the free space but
also the total space (size of the disk) and where it's mounted.
- The various fields are relatively self explanatory. There are two that
are of particular importance:
- Error Count, should be 0, but is often 11 for Elite 9's (9Gig disks)
immediately after reboot. This error count is only those errors seen on
the local system, so you'll often have to go to the disk's host machine to
really find out if there are errors occuring. It all depends on what
machine saw the error.
- Device Status
- should be "Mounted" for most disks.
- Some disks will show "ONLINE" on all but one machine. These are
usually disks that only contain a paging file for the host machine and
nothing else. ONLINE means that the disk could be mounted on
the cluster, but isn't. There are two other times when this will
happen: right after a reboot, it takes a while for MAD_MOUNTER to mount
all the disks; the disk has died and been dismounted from the cluster.
In the latter case, it'll remain "ONLINE" even after having been
physically removed from the machine, on all machines that haven't
rebooted.
- Mount Verify or Mount Verify Timeout: These are bad states
to be in. Mount Verify means that the disk has stopped communicating and
the system is trying to reestablish communications. All processes trying
to access this disk with hang until the system times out (Mount Verify
Timeout), currently after about 17 hours. Once it times out, a fatal
error will be returned to all processes accessing that disk and the
process or command will die. This is by far the most common reason for
processes to hang. If it's a paging disk that goes into Mount Verify,
then a lot of processes will hang, including, perhaps the basic
operating system processes on that machine. This might cause the machine
to hang or even crash
-
SHOW USER[/NODE=...][/FULL] [user_name]
shows users on the machine or cluster. With no user specified, will show all
users. /FULL gives information on each process. Without the /FULL will show
only the users and the number of processes of each type they own.
-
SHOW MEMORY
gives a lot of
information. The only information useful to most users is the Paging File
Usage at the bottom of the display, and perhaps the Main Memory. The useful
paging file information is:
- disk containing the PAGEFILE.SYS file,
- it's Total size and
- Free space. (Reservable will often be negative, so isn't really
useful). The Free space is crucial. If it goes to zero, your system, at best
will become extremly slow and processes will begin to hang. At worst your
entire system will hang. The solution for low free space is to reduce the
number or size of jobs using it, or increase the page file space at the
expense of usable disk space.
Examples
One or a Few Windows Hung
If one or a few of your Windows are hung, the first thing that you should do is
to hit Cntr-Q (control key + q). You may have that window paused. You'd be
surprised how much time is wasted determining that someone hit Cntr-S or the
Pause key without realizing it. If that doesn't work:
- determine if anything works on the machine you're sitting in front
of, or which is hosting your X-Terminal session. To do this, make sure that
you can switch window context. If you can, your window manager is OK and the
machine is actually running.
- If you can switch context, try typing some command in one of your
windows. SHOW SYSTEM is good, do not ask for a directory. Even SHOW
SYSTEM could hang, but if it does that'd be because the SYSTEM disk has
taken a vacation, and the system people will find out about that real
quick.
- If you can't switch context or no commands work in any windows, go to
the example for All Windows Hung.
- If some windows work and the hung one(s) don't, then try to
determine what was going on in the hung window. Most likely you were
trying to access some disk that has gone out to lunch.
- If you're lucky a SHOW DEV DK will show some disk that you might
have been using in MOUNT VERIFY (note: MOUNT VERIFY TIMEOUT won't hang
windows). Since a disk will only go into MOUNT VERIFY when the disk is
accessed on a particular node, you might have to login to the same node
the the hung window is on.
- If that doesn't work (doesn't always) you may have to try accessing
specific disks that you suspect until one hangs, DIR ... will do it.
NOTE: either use a window that you're not likely to use for a while, pop
a new one just for this purpose or SET HOST from some other window. The
latter will allow you to disconnect using double Cntrl-Y even if you
hang.
- If you hang trying to login, there are three possiblities: the
SYSTEM or CLUSTER$COMMON disk is gone, your USR$ROOT disk is gone or
some disk (D0Library or some PRJ or TMP disk) that you access in your
LOGIN.COM is in trouble. In the first case there's nothing you can do,
except call the help desk (x2345) with the information. Tell them which
machine you're on. There are multiple SYSTEM disks, so only some of the
machines, the ones that boot from that disk will hang.
In the latter cases, you can: SET HOST machine, then qualify your
username with /NOCOMMAND. This will skip running your LOGIN.COM. If this
let's you in, you won't be able to use any customizations you do in your
LOGIN, but all of the system commands will work and you can try to find
out what's wrong. Your USR$ROOT will be obvious, DIR immediately after
logging in will hang. If it's some disk accessed in your LOGIN, you'll
probably have to do a binary search by commenting out parts of your
login and running the rest. To edit your LOGIN.COM, you'll either need
to know one of the standard editors (D0's EVE isn't one) or setup
D0Library by hand (@D0$DISK:[D0LIBRARAY]D0LOCAL.COM) assuming it's not
D0Library's disk that's gone west.
All Windows Hung
This is actually a fairly simple case. It usually is when something fails hard.
In this case you need to determine only a few things: is the cluster up, is your
boot node up and is your machine up?
- If anyone on any machine in your cluster can work, the cluster is up. If
not call x2345, though they probably would know already.
- On the Alpha Cluster and D0SFT, there is only one boot node (D0TNG and
D0SFB) so if the cluster is up, the boot node is too. On FNALD0, from any
working window, do SHOW SYSTEM/NODE=boot_node where boot_node can be D0GS01
through D0GS05. If any of them return an error, call x2345 with the
information.
- That leaves your machine. Do SHOW SYSTEM/node=your_node.
- If it returns error, then you'll have to reboot your machine, reset
button or power switch.
- If it returns with a display, your machine is running, at least to
some level.
- Do a SET HOST to your machine. If you're able to login, then your
window manager or session manager has hung. To start a new session manager
SET HOST to your machine using the username DECW_RESTART, no password is
required and answer the questions. Make sure you restart the session
manager on the right machine. This will blow away all of your windows,
sorry can't be helped, and give you the normal LOGIN box.
- If you hang at login, or get an error, you machine is hung. This
could be due to several causes. The worst one is that the system ran out
of paging file space (see SHOW MEMORY).
Unfortunately, if the system has really hung, it may be impossible to
determine if this is the case. Another possibility is that the disks, in
particular, the paging disk, have gone into MOUNT VERIFY. Check this with
SHOW DEVICE. In either case, a reboot is almost
certainly necessary. Please send MAIL to
'clustername_SYSTEM_MGR' informing us what you've done and why so
we can watch your system for signs of further problems.
Alan M. Jonckheere
jonckheere@fnal.gov
last modified 5/30/96