Simple Unix System Diagnostics

This document is meant to help an unskilled and non-privileged user discover various facts about the UNIX system they are using, it's operation, resources and status. In particular, it is meant to allow a user to find out enough information about their machine so that they can discover why they are having problems, or, at least, give sufficient information to a system manager so that they can figure out why the user is having problems. In either case, figuring out what's wrong, if anything, goes a long way to fixing the problem.

To know what is going on on your system, the most common commands to use are ps (process status) and sar (system activities report).

To know which disks are mounted in your system, use df (disk freespace). To know the file system disk space usage information, use du (disk usage).

To kill a process, use kill. Non-privileged users can only kill their own processes.


ps

Useful options:

-e            show all processes    
-l            show long list
-f            show full list
-u username   show processes owned by a specific user 
For example:

d0sgi6[70]% ps -ef
     UID   PID  PPID  C    STIME TTY      TIME COMD
    root     0     0  0   Nov 17 ?        0:00 sched
    root     1     0  0   Nov 17 ?        2:23 /etc/init 
    root     2     0  0   Nov 17 ?        0:00 vhand
    root     3     0  0   Nov 17 ?        5:27 bdflush
    root     4     0  0   Nov 17 ?        2:25 vfs_sync
    root     5     0  0   Nov 17 ?        0:00 pdflush
dongzhao 19183     1  0   Dec 01 ?        0:03 xwsh -name winterm 

...proceses were neglected to save space...

key things to check:

 TIME  - how many minutes of cpu time has the process been using mm:ss

 STIME - how long ago was the process started  hh:mm:ss or date if 
         not started today

 UID   - who is running the job

 C     - the higher the number the more cpu cycles the job gets 
         (this is not a priority) just a way to show you that if 
         the computer has nothing to do then it's working on the jobs 
         with the higher numbers more often

d0sgi6[77]% ps -el
 F S   UID   PID  PPID  C PRI NI  P    SZ:RSS      WCHAN TTY      TIME COMD
39 S     0     0     0  0  39 RT  *     0:0     801632c0 ?        0:00 sched
30 S     0     1     0  0  39 20  *    69:41    801632f0 ?        2:24 init
39 S     0     2     0  0  39 RT  *     0:0     80163180 ?        0:00 vhand
30 S  6354  8927  8926  0  26 20  *   372:67    8025c440 pts/1    0:00 telnet
30 S     0  8151   205  0  26 20  *   294:47    8025c3c0 pts/0    0:01 rlogind
30 S  6354 13670 13669  1  39 20  *   538:222   801632f0 pts/3    0:07 tcsh
30 R  6354  9743 13670 10  65 20  0   320:57             pts/3    0:00 ps
30 S  6354 19168     1  0  26 20  *  1494:664   8025c250 ?        1:23 4Dwm
30 S  6354 20166     1  0  26 10  *   763:301   8025c2b0 ?        0:35 xwsh

...proceses were neglected to save space...

key fields:

NI -  priority of the job (also called nice #) 
      20 means standard priority, 0 is the highest, 40 is the lowest

SZ -  amount of memory the program uses in 4096 bytes

RSS - amount of memory of the program actually in RAM in 4096 bytes

S   - shows whats R Running or S sleeping at the time the 
      ps command was executed

If you want to figure out how much memory a program is using, take RSS * 4096 

     the 4Dwm is using the most memory, 664 * 4096 = 2.7 MB

sar

Useful options:

-u     CPU usage report
-r     Memory usage report
-d     device report

For example:

sar -u 5 5 ( system CPU usage giving 5 samples of 5 seconds each )

d0sgi6[80]% sar -u 5 5

IRIX d0sgi6 5.3 11091810 IP12    12/03/97

13:39:20  %usr  %sys %intr  %wio %idle %sbrk  %wfs %wswp %wphy %wgsw %wfif
13:39:25     2     3     3     0    92     0     0     0     0     0     0
13:39:30     4     3     2     0    91     0     0     0     0     0   100
13:39:35     1     2     2     1    94     0    50     0     0     0    50
13:39:40     5     3     6     0    86     0     0     0     0     0   100
13:39:45     6     3     2     5    84     0     0     0     0     0   100


13:39:45  %usr  %sys %intr  %wio %idle %sbrk  %wfs %wswp %wphy %wgsw %wfif
Average      4     3     3     1    89     0     7     0     0     0    93

what percentage are processes being run in user, system, 
   interupt, wait i/o and idle modes

if idle percent is high then there is plenty of CPU time to run your program. 
wait i/o (wio) means that processes are waiting for data to be retreived 
   from disk (most likely)

sar -r 5 2  (show me memory pages, every 5 seconds I want 2 samples)

d0sgi6[82]% sar -r 5 2

IRIX d0sgi6 5.3 11091810 IP12    12/03/97

13:42:37 freemem freeswp
13:42:42    5801  210000
13:42:47    5801  210000


13:42:47 freemem freeswp
Average     5801  210000

 freemem is pages of free memory in 4096 bytes/page 
 freeswap is that amount of free swap space in 512Kb disk blocks 

        4096 * 5801  =  23.8 MB of free RAM
        512 * 210000 =  107  MB of free Swap space


df

df reports the number of total, used, and available disk blocks (one disk block equals 512 bytes) in file systems.
d0sgi6[83]% df
Filesystem                 Type  blocks     use   avail %use  Mounted on
/dev/root                   efs   37615   20519   17096  55%  /
/dev/usr                    efs  853020  517401  335619  61%  /usr
/dev/dsk/dks0d2s7           efs 1975100 1695823  279277  86%  /exports/data0
/dev/dsk/dks0d1s2           efs  879625  740154  139471  84%  /exports/usr/people
d0chb:/d0dist               nfs 4426512 3956200  470312  89%  /d0dist
d0cha:/d0library            nfs 4319768 2972823 1346945  69%  /d0library
d0sgi0:/usr/local           nfs  651875  146295  505580  22%  /usr/local
d0sgi0:/usr/products        nfs 2406580 2268638  137942  94%  /usr/products
d0sgi0:/exports/usr/peo     nfs 7603592 7230643  372949  95%  /tmp_mnt/d0sgi0/usr0

du

du reports the number of blocks contained in specified files or directories, if no names are given, the current directory is used.

Useful options:

-s   causes only the grand total (for each of the specified names) to be given.
-k   will cause du to express all block counts in terms of 1024 byte 
     blocks, instead of the default 512 byte blocks.  

Example:

d0sgi6[92]% ls -l
total 5
drwxr-xr-x    2 berezhno D0           512 Sep  1  1994 berezhnoi/
drwxr-xr-x    2 bhat     D0           512 Nov  9  1993 bhat/
drwxr-xr-x    2 diesburg D0           512 Nov  9  1993 diesburg/
drwxr-xr-x    4 dongzhao D0           512 Nov 14 11:44 dongzhao/
drwxr-xr-x    6 root     sys          512 Jan 20  1994 reco_comp/

d0sgi6[93]% du -sk *
5545    berezhnoi
1       bhat
1       diesburg
11557   dongzhao
830735  reco_comp


kill

kill sends a signal to the specified processes. The default is to terminate a process by sending signal SIGTERM. For a sure kill, use option "-9" to send a signal SIGKILL to processes.

If you suspect that some of your processes are going wrong, first use ps -fu username to get a list of your processes, then use kill -9 PID to kill a process.

For example:

d0chb[45]% ps -fu piaf
     UID   PID  PPID  C    STIME TTY      TIME COMD
    piaf 26224 26206  1 14:03:52 ttyq13   0:03 -tcsh 
    piaf  7930 26224  8 14:20:30 ttyq13   0:00 ps -fu piaf 
    piaf 16056     1  0   Dec 02 ?        3:38 Netscape.real 

d0chb[46]% kill -9 16056
will kill the Netscape process.


Last modified by Dong Zhao on December 3 1997