D0 FARM Shift Instructions for October 2001

 

 
 
 
 
 

GETTING READY
WHAT TO DO
CHECKING ON PRESENT STATUS
STARTING NEW JOBS
SUBMITTING A JOB
CHECKING on JOBS
CHECKING AND RESUBMITTING
WEEKLY REPORTS

EXAMPLES

GETTING READY
 

Shifters need to log in at least twice a day, at about 9AM and after 5PM

log into d0bbin  as d0farm

kinit as yourself again

type

source FARM_SETUP
this will put you in the run directory and set the $D0FARM_DIR environmental to point
to the location of the scripts


WHAT TO DO

Your first priority is to submit new jobs, then try to get ones that failed to work.

Jobs can fail to process all files for several reasons:

File was not in sam yet from online when job submitted (unavoidable)
File was not delivered to farm because of tape failure (common)
File was not delivered to farm because of staging or db failure (rare but a real bad thing when it happens)
File was delivered but reco or reco_analyze crashed (rare)
File was delivered but node crashed due to memory overload or disk failure (rare)
File was processed but not copied back to the I/O node (rare)
File has not been stored to tape yet due to tape failure (common)

CHECKING ON PRESENT STATUS
 

Look at the FBSWWW display to see what's going on in a global way:

http://www-isd.fnal.gov/cgi-bin/fbsng/fbswww/fbswww?action=graphs&farm=D0
 
 

You can see individual jobs by doing

fbs lj                                         to see a list of jobs
listjobs.py <jobno>                  to see a single job's details
fbs status                                  to see all jobs in gory detail
fbs status <jobno>                    to see details for a particular job

type

checknodes

this will give a dump (once again taking a long time) of the status of the farm.
 

STARTING NEW JOBS:
 
1. First find out what version we are running from the previous shifter.

On Sept 23 it was t01.56.00

2. Check to see if new shift datasets have come in from the control room.

The shift captains will be creating datasets with keyword shift-runset.
 
 

sam list definitions --defname=shiftset%
To get a detailed list with status:
check_shift <version> %
the % can be replaced by a substring like %-2%sep% for all files from the 20th-29th of september.


This will take some time but will create  a list of the shiftsets and how many files have
been processed through that version.

If there are no shift datasets, you should send a message to d0shifters asking the status of the
shiftset for any missing shifts in the past 24 hours.


SUBMITTING A JOB
 

First you have to transform the shift dataset into one for the farms.  This just
adds a check to see if the files have already been processed through the same
version:
 
make_store <shiftset-name> <version>
This will make a new dataset definition farm-nb-<shiftset-name>-<version>
and put a listing in the ~d0farm/run/projects subdirectory.  We'll refer
to this as the <farm-dataset>

To see how many files there are in your project do

size.csh <farm-dataset>

To submit  a job:

runrecocert <farm-dataset> <number_of_nodes> <version> <queue>

runrecocert farm-shiftset-owl-09-aug-2001-raw-t01.54.00 10 t01.54.00 TitaniumQ

Queue is normally TitaniumQ which has 76 fast CPU's with 512 MB/processor
Number of nodes should be < ~40 and also < # of files.

To check on the job:

listjobs.py <jobno>
 

The logs will go into

~d0farm/run/<date> (script logs)

or into

/d0/stripeX/samtest/<jobno>/logs and /d0/stripeX/samtest/<jobno>/badinput/logs (framework logs)
 

Checking up on jobs

 

What's running


To see which jobs are running

listprojects

This will summarize job numbers, the project names and the sections which are working.
 

To see if disks are filling up


df

Disks should be less than 90% full /stripe4 and stripe9 are an exception - they are used by other groups.
/stripe8 is used for test output.
 

To see if file stores are working


ps -ef | grep -c storeafile3.py

tells you how many file stores are queued up.
 

CHECKING AND RESUBMITTING


This will be automated soon but for now it is a royal, extreme pain.

Every couple of days, you want to resubmit a bunch of old jobs.

You can find the status of existing jobs by looking at

the check_shift dump

and going to the

~/run/jobdumps directory
and running the script

check_dump on the dump files which are created at end of job
 
 

WEEKLY SUMMARIES


There is a script in

~schellma/v03-01-00/farm_machinery/samutils on d0mino called

production_summary.csh

production_summary.csh 09/10/2001 09/16/2001 t01.56.00

will give t01.56 summary for  dates between the 10th and 16th inclusive.

You need to post this in

~WWW/docs/computing/production/weekly_reports
 
 

EXAMPLES

 

 
 
 
 
 
 

<d0bbin> source FARM_SETUP
 
 

<d0bbin> fbs lj
JobID  State      User     Sections
------ ---------- -------- ----------------------
5704   running    d0farm   END:d START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5705   running    d0farm   END:d START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5707   running    d0farm   END:d START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5712   running    d0farm   END:w START_SAM:* WORKER_JOB:* _JOB_CONTROL:*

WORKER_JOB is the LINUX CPU
START_SAM is the master control job
_JOB_CONTROL is the monitor process

* means running,
d means dead,
w means waiting for other job section,
r means waiting for resources

<d0bbin> checknodes
Summary of activity for user d0farm
Batch status:    37 jobs, 33 D0reco_x, 0 RecoAnalyze_x, 3 waiting for data
File stores in progress:         0
ls: No match.
Files which need to be stored:   queued:  0      to be queued:  1

Here there are 37 jobs running on worker nodes, 33 are doing D0reco_x, 3 are waiting for data
and one is doing something else.
 
 

d0bbin >  listjobs.py 5712
----------------------- 5712 --------------------
Status of Batch job:     5712
Analysis Project:        farm.t01.54.00.5712
Project Definition:      farm-shiftset-eve-01aug04-raw-t01.54.00         Files to go:
Output Buffer:            /d0/stripe7/samtest/5712
Batch Queue:             TitaniumQ
Batch status:    20 jobs, 17 D0reco_x, 0 RecoAnalyze_x, 0 waiting for data, 3 finished
File stores in progress:         0
Sam Project Status:      For files: 1..32 errors: 11 in progress: 20 finished 0
Sam Consumer ID:         28256
Of 32 project files, umer saw 11 (20 good + 11 bad).
Umer failed to process 0 good files and missed 0 files
Number of files which crashed: 1
Number of files which are ok: 1
Machines:d0bbin, d0bbin, fnd080, fnd078, fnd079, fnd075, fnd076, fnd077, fnd071, fnd072, fnd073, fnd081, fnd083, fnd082, fnd085, fnd084, fnd087, fnd088, fnd056

5712 is the job number
Output buffer is where the data/logs will go
Batch Status tells you what the workers are doing
Sam Project status:
error = can't get the file
in progress = file has been cached, not finished yet
finished = released by d0 code after processing
In this job, two file are actually done, one crashed and the other was ok.

<d0bbin> fbs status 5713
  Section ID: 5713._JOB_CONTROL   Name: _JOB_CONTROL

  Process resources: IO:1
  Section resources: StartSections:1
  Start Time: Fri Aug 10 09:04:22 2001    End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/jobControl.c*
  State:      running                     Depend:
  ProcType:   StartSAM                    Queue: StartQueue
  NumProc:    1                           Nice: 0

  Process #1  (5713._JOB_CONTROL.1)  on d0bbin Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  616092      0m01s  0m52s /bin/tcsh -f /home/d0farm/v00-03-01/farm_mac*
  616007      0m51s  0m51s  python -u /home/d0farm/v00-03-01/farm_machi*
  ------------------------------------------------------------------------

  Section ID: 5713.START_SAM   Name: START_SAM

  Process resources: IO:1
  Section resources: StartSections:1
  Start Time: Fri Aug 10 09:04:22 2001    End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startProject*
  State:      running                     Depend: started(_JOB_CONTROL)
  ProcType:   StartSAM                    Queue: StartQueue
  NumProc:    1                           Nice: 0

  Process #1  (5713.START_SAM.1)    on d0bbin Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------

  616075      0m01s  0m54s /bin/tcsh -


  612846      0m53s  0m53s  python -u /home/d0farm/v00-03-01/farm_machi*
  632733          0      0   sleep 600
  615346          0      0  /bin/tcsh -f /home/d0farm/v00-03-01/farm_ma*
  616469          0      0   sleep 28800
  ------------------------------------------------------------------------
  Section ID: 5713.WORKER_JOB   Name: WORKER_JOB

  Process resources: Titanium:1 cpu:100
  Section resources:
  Start Time: Fri Aug 10 09:04:23 2001    End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startWorker.*
  State:      running                     Depend: started(START_SAM)
  ProcType:   Worker_15                   Queue: TitaniumQ
  NumProc:    10                          Nice: 0

  Process #1  (5713.WORKER_JOB.1)   on fnd080 Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  15889           0 51m58s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
  16075       0m46s 51m58s  python -u /home/d0farm/v00-03-01/farm_machi*
  17984      51m12s 51m12s   ./D0reco_x -rcp runD0reco_data.rcp -out re*
  16702           0      0   (python)
  16035           0      0  tcsh -f /home/d0farm/v00-03-01/farm_machine*
  16037           0      0   sleep 28800

  Process #2  (5713.WORKER_JOB.2)   on fnd078 Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  9652            0 51m41s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
  9825        0m47s 51m41s  python -u /home/d0farm/v00-03-01/farm_machi*
  10465           0      0   (python)
  11720      50m54s 50m54s   ./D0reco_x -rcp runD0reco_data.rcp -out re*
  9798            0      0  tcsh -f /home/d0farm/v00-03-01/farm_machine*
  9800            0      0   sleep 28800

...........

..........

Process #10 (5713.WORKER_JOB.10)  on fnd073 Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  10919           0 50m29s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
  11065           0      0  tcsh -f /home/d0farm/v00-03-01/farm_machine*
  11067           0      0   sleep 28800
  11105       0m45s 50m29s  python -u /home/d0farm/v00-03-01/farm_machi*
  13017      49m44s 49m44s   ./D0reco_x -rcp runD0reco_data.rcp -out re*
  11759           0      0   (python)
  ------------------------------------------------------------------------

  Section ID: 5713.END   Name: END

  Process resources: IO:1
  Section resources:
  Start Time: Not Started                 End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/stopProject.*
  State:      waiting                     Depend: ended(WORKER_JOB)
  ProcType:   EndSAM                      Queue: EndQueue
  NumProc:    1                           Nice: 0
  ------------------------------------------------------------------------
 

<d0bbin> sam list definitions --defname=shiftset%sep%
 

 Dataset Def Name                                   Create Date         User Name  Work Group
shiftset-eve-01-sep-2001                           09/01/2001 17:12:54 d0run      dzero
shiftset-owl-02-sep-2001                           09/02/2001 06:51:39 d0run      dzero
shiftset-day-02-sep-2001                           09/02/2001 16:59:15 d0run      dzero
shiftset-owl-04-sep-2001                           09/04/2001 08:54:56 d0run      dzero
shiftset-day-04-sep-2001                           09/04/2001 17:19:32 d0run      dzero
shiftset-eve-04-sep-2001                           09/05/2001 01:29:10 d0run      dzero
shiftset-day-05-sep-2001                           09/05/2001 16:46:20 d0run      dzero
shiftset-eve-05-sep-2001                           09/06/2001 00:58:39 d0run      dzero
shiftset-day-06-sep-2001                           09/06/2001 17:00:28 d0run      dzero
shiftset-eve-06-sep-2001                           09/07/2001 01:06:28 d0run      dzero
shiftset-day-08-sep-2001                           09/08/2001 16:45:14 d0run      dzero
shiftset-owl-09-sep-2001                           09/09/2001 08:53:29 d0run      dzero
shiftset-day-09-sep-2001                           09/09/2001 17:09:20 d0run      dzero
shiftset-eve-09-sep-2001                           09/10/2001 01:09:47 d0run      dzero
shiftset-owl-10-sep-2001                           09/10/2001 08:30:30 d0run      dzero
shiftset-eve-08-sep-2001                           09/10/2001 18:56:06 d0run      dzero
shiftset-day-14-sep-2001                           09/14/2001 17:33:36 d0run      dzero
shiftset-eve-14-sep-2001                           09/15/2001 01:11:21 d0run      dzero
shiftset-owl-15-sep-2001                           09/15/2001 08:34:35 d0run      dzero
shiftset-owl-15-sep-2001b                          09/15/2001 08:35:21 d0run      dzero
shiftset-eve-15-sep-2001                           09/16/2001 01:30:44 d0run      dzero
shiftset-owl-16-sep-2001                           09/16/2001 08:54:47 d0run      dzero
shiftset-day-16-sep-2001                           09/16/2001 16:49:43 d0run      dzero
 
 

 
 
 
 
 

<d0bbin> make_store shiftset-owl-09-aug-2001-raw t01.54.00
Files:
   halo_0000127640_001.raw
   halo_0000127640_002.raw
   halo_0000127640_003.raw
   halo_0000127640_004.raw
   halo_0000127640_005.raw
   halo_0000127640_006.raw
   halo_0000127640_007.raw
   halo_0000127640_008.raw
   halo_0000127640_009.raw
   halo_0000127640_010.raw

File Count:  0
Average File Size:  255781

Pause for 15 seconds, can CTL-C if aren't interested

Files:
   halo_0000127640_001.raw
   halo_0000127640_002.raw
   halo_0000127640_003.raw
   halo_0000127640_004.raw
   halo_0000127640_005.raw
   halo_0000127640_006.raw
   halo_0000127640_007.raw
   halo_0000127640_008.raw
   halo_0000127640_009.raw
   halo_0000127640_010.raw

File Count:  0
Average File Size:  255781

Dataset definition created with Id:  6603

Data set is farm-nb-<shiftset-name>-<version>

<d0bbin> runrecocertfarm-shiftset-owl-09-aug-2001-raw-t01.54.00 10 t01.54.00 TitaniumQ
Number of nodes choosen:  10
Recostruction version applied  t01.54.00
we are running against:  prd
disk  /d0/stripe7
recon&root/recon/root:  recon_root
number of events for reconstruction:  0
Date: Aug10
Cannot create directory "Aug10": File exists
Farm Job 5714 has been submitted...
 
 

listprojects
listprojects
7080 farm-nb-shiftset-day-06-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB:* _JOB_CONTROL:*
7084 farm-nb-shiftset-owl-04-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB:* _JOB_CONTROL:*
7097 farm-nb-shiftset-eve-15-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7381 farm-nb-shiftset-owl-10-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:d _...
7474 farm-nb-shiftset-day-05-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7475 farm-nb-shiftset-eve-09-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7476 farm-nb-shiftset-day-08-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7478 farm-recocert-p09.08.00-sim-p10.04.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:* _...
7479 farm-nb-recocert-129194-raw-p10.04.00 running END:d START_SAM:* WORKER_JOB1:d WORKER_JOB2:d _...
7487 farm-nb-shiftset-owl-16-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:d WORKER_JOB2:* _...
7488 farm-nb-shiftset-owl-15-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7520 farm-nb-shiftset-owl-23-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:d _...
7560 recocert-129194-raw running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
This shows the status of the various sections of the job and which project it is running.

'*' means running
'r' means 'ready' This is bad as it means hung by some other queue
'd' means 'dead' Can happen for WORKER_JOB2 sections if you run a null section
'w' means waiting - normal for END sections, they wait for the WORKER sections
'f' means failed Can also happen for WORKER_JOB2 sections
'c' means canceled
 

df


/dev/root               xfs 13673424  7113920  6559504  53  /
/dev/xlv/stripe9        xfs 142161776 133177520  8984256  94  /d0/stripe9
/dev/xlv/stripe5        xfs 142161776 65462904 76698872  47  /d0/stripe5
/dev/xlv/stripe8        xfs 142161776 47465048 94696728  34  /d0/stripe8
/dev/xlv/stripe7        xfs 142161776 28237224 113924552  20  /d0/stripe7
/dev/xlv/stripe6        xfs 142161776 12048416 130113360   9  /d0/stripe6
/dev/xlv/d0farm         xfs 35534776 30013544  5521232  85  /export/d0farm
/dev/xlv/stripe4        xfs 142162560 117335864 24826696  83  /d0/stripe4
/dev/xlv/stripe3        xfs 142161280 97439040 44722240  69  /d0/stripe3
/dev/xlv/crash          xfs 35534776  4518384 31016392  13  /var/adm/crash
/dev/xlv/stripe2        xfs 142162560 53248816 88913744  38  /d0/stripe2
/dev/xlv/stripe1        xfs 142161280 46144600 96016680  33  /d0/stripe1
/dev/dsk/dks2d8s1       xfs  6284240  3480872  2803368  56  /export/products
/dev/dsk/dks2d8s4       xfs  6259192     8600  6250592   1  /export/usr/local
/dev/dsk/dks2d8s0       xfs 11480952  4677144  6803808  41  /export/home
 

disks stripe1,2,3,5,6 and 7 should not be near 90%.  The worker jobsshould stall if it hits 90% on a disk.
 

check_dump


check_dump farm-nb-shiftset-eve-16-sep-2001-t01.56.00_7299.dump

File: farm-nb-shiftset-eve-16-sep-2001-t01.56.00_7299.dump
Files in Project  42
Good files (32
Undelivered files:  10
Reco crashed: 2
Have  12  you probably can't process
Undelivered include 0/0 NOACCESS/NOTALLOWED files and 0 which timed out

As the undelivered files were not NOACCESS, I'd resubmit this one

check_dump farm-nb-shiftset-owl-18-sep-2001-t01.56.00_7119.dump

File: farm-nb-shiftset-owl-18-sep-2001-t01.56.00_7119.dump
Files in Project  47
Good files (47
Undelivered files:  0
Reco crashed: 0
Have  0  you probably can't process
Undelivered include 0/0 NOACCESS/NOTALLOWED files and 0 which timed out

This one is DONE!!

check_dump farm-nb-shiftset-owl-22-sep-2001-t01.56.00_7490.dump

File: farm-nb-shiftset-owl-22-sep-2001-t01.56.00_7490.dump
Files in Project  123
Good files (0
Undelivered files:  123
Reco crashed: 0
Have  123  you probably can't process
Undelivered include 123/0 NOACCESS/NOTALLOWED files and 0 which timed out

Your worst nightmare - all of the files, every single last one, is on a bad tape.  Bummer..

May be able to rerun in a couple of days.
 

production_summary


<d0mino> production_summary.csh