D0 FARM Shift Instructions for August 2001


GETTING READY
CHECKING ON PRESENT STATUS
STARTING NEW JOBS
SUBMITTING A JOB
EXAMPLES

GETTING READY
 

Shifters need to log in at least twice a day, at about 9AM and after 5PM

log into d0bbin  as d0farm

kinit as yourself again

type

source FARM_SETUP
this will put you in the run directory and set the $D0FARM_DIR environmental to point
to the location of the scripts

 

CHECKING ON PRESENT STATUS
 

Look at the FBSWWW display to see what's going on in a global way:

http://www-isd.fnal.gov/cgi-bin/fbsng/fbswww/fbswww?action=graphs&farm=D0

type

checknodes

this will give a dump (once again taking a long time) of the status of the farm.

You can see individual jobs by doing

fbs lj                                         to see a list of jobs
listjobs.py <jobno>                  to see a single job's details
fbs status                                  to see all jobs in gory detail
fbs status <jobno>                    to see details for a particular job
 

STARTING NEW JOBS:
 

1. First find out what version we are running from the previous shifter.

On August 10th it was t01.54.00

2. Check to see if new shift datasets have come in from the control room.

The shift captains will be creating datasets with keyword shift-rawset or store-dataset.
The store-datasets are the old instructions and the shift-rawset should become the
standard soon.

To get a fast list:
 

sam list definitions --defname=shiftset%raw
sam list definitions --defname=store% | grep -v 1x8


To get a detailed list with status:

check_shift <version>


This will take some time but will create  a list of the shiftsets and how many files have
been processed through that version.

If there are no shift datasets, you may need to call the control room and what runs
were good from the last shift.  Until the new procedures get in place, shift captains
are probably assuming that they can wait until the end of the store, which may be
days long.


SUBMITTING A JOB
 

First you have to transform the shift dataset into one for the farms.  This just
adds a check to see if the files have already been processed through the same
version:
 
make_store <shiftset-name> <version>
This will make a new dataset definition farm-<shiftset-name>-<version>
and put a listing in the ./projects subdirectory

To submit  a job:

runrecocert <dataset> <number_of_nodes> <version> <queue>

runrecocert farm-shiftset-owl-09-aug-2001-raw-t01.54.00 10 t01.54.00 TitaniumQ

Queue is normally TitaniumQ which has 76 fast CPU's with 512 MB/processor
Number of nodes should be < ~40 and also < # of files.

To check on the job:

listjobs.py <jobno>
 

The logs will go into

~d0farm/run/<date> (script logs)

or into

/d0/stripeX/samtest/<jobno>/logs and /d0/stripeX/samtest/<jobno>/badinput/logs (framework logs)
 

EXAMPLES

 

 
 

<d0bbin> source FARM_SETUP
 
 

<d0bbin> fbs lj
JobID  State      User     Sections
------ ---------- -------- ----------------------
5704   running    d0farm   END:d START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5705   running    d0farm   END:d START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5707   running    d0farm   END:d START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5712   running    d0farm   END:w START_SAM:* WORKER_JOB:* _JOB_CONTROL:*

WORKER_JOB is the LINUX CPU
START_SAM is the master control job
_JOB_CONTROL is the monitor process

* means running,
d means dead,
w means waiting for other job section,
r means waiting for resources

<d0bbin> checknodes
Summary of activity for user d0farm
Batch status:    37 jobs, 33 D0reco_x, 0 RecoAnalyze_x, 3 waiting for data
File stores in progress:         0
ls: No match.
Files which need to be stored:   queued:  0      to be queued:  1

Here there are 37 jobs running on worker nodes, 33 are doing D0reco_x, 3 are waiting for data
and one is doing something else.
 
 

d0bbin >  listjobs.py 5712
----------------------- 5712 --------------------
Status of Batch job:     5712
Analysis Project:        farm.t01.54.00.5712
Project Definition:      farm-shiftset-eve-01aug04-raw-t01.54.00         Files to go:
Output Buffer:            /d0/stripe7/samtest/5712
Batch Queue:             TitaniumQ
Batch status:    20 jobs, 17 D0reco_x, 0 RecoAnalyze_x, 0 waiting for data, 3 finished
File stores in progress:         0
Sam Project Status:      For files: 1..32 errors: 11 in progress: 20 finished 0
Sam Consumer ID:         28256
Of 32 project files, umer saw 11 (20 good + 11 bad).
Umer failed to process 0 good files and missed 0 files
Number of files which crashed: 1
Number of files which are ok: 1
Machines:d0bbin, d0bbin, fnd080, fnd078, fnd079, fnd075, fnd076, fnd077, fnd071, fnd072, fnd073, fnd081, fnd083, fnd082, fnd085, fnd084, fnd087, fnd088, fnd056

5712 is the job number
Output buffer is where the data/logs will go
Batch Status tells you what the workers are doing
Sam Project status:
error = can't get the file
in progress = file has been cached, not finished yet
finished = released by d0 code after processing
In this job, two file are actually done, one crashed and the other was ok.

<d0bbin> fbs status 5713
  Section ID: 5713._JOB_CONTROL   Name: _JOB_CONTROL

  Process resources: IO:1
  Section resources: StartSections:1
  Start Time: Fri Aug 10 09:04:22 2001    End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/jobControl.c*
  State:      running                     Depend:
  ProcType:   StartSAM                    Queue: StartQueue
  NumProc:    1                           Nice: 0

  Process #1  (5713._JOB_CONTROL.1)  on d0bbin Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  616092      0m01s  0m52s /bin/tcsh -f /home/d0farm/v00-03-01/farm_mac*
  616007      0m51s  0m51s  python -u /home/d0farm/v00-03-01/farm_machi*
  ------------------------------------------------------------------------

  Section ID: 5713.START_SAM   Name: START_SAM

  Process resources: IO:1
  Section resources: StartSections:1
  Start Time: Fri Aug 10 09:04:22 2001    End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startProject*
  State:      running                     Depend: started(_JOB_CONTROL)
  ProcType:   StartSAM                    Queue: StartQueue
  NumProc:    1                           Nice: 0

  Process #1  (5713.START_SAM.1)    on d0bbin Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  616075      0m01s  0m54s /bin/tcsh -f /home/d0farm/v00-03-01/farm_mac*
  612846      0m53s  0m53s  python -u /home/d0farm/v00-03-01/farm_machi*
  632733          0      0   sleep 600
  615346          0      0  /bin/tcsh -f /home/d0farm/v00-03-01/farm_ma*
  616469          0      0   sleep 28800
  ------------------------------------------------------------------------
  Section ID: 5713.WORKER_JOB   Name: WORKER_JOB

  Process resources: Titanium:1 cpu:100
  Section resources:
  Start Time: Fri Aug 10 09:04:23 2001    End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startWorker.*
  State:      running                     Depend: started(START_SAM)
  ProcType:   Worker_15                   Queue: TitaniumQ
  NumProc:    10                          Nice: 0

  Process #1  (5713.WORKER_JOB.1)   on fnd080 Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  15889           0 51m58s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
  16075       0m46s 51m58s  python -u /home/d0farm/v00-03-01/farm_machi*
  17984      51m12s 51m12s   ./D0reco_x -rcp runD0reco_data.rcp -out re*
  16702           0      0   (python)
  16035           0      0  tcsh -f /home/d0farm/v00-03-01/farm_machine*
  16037           0      0   sleep 28800

  Process #2  (5713.WORKER_JOB.2)   on fnd078 Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  9652            0 51m41s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
  9825        0m47s 51m41s  python -u /home/d0farm/v00-03-01/farm_machi*
  10465           0      0   (python)
  11720      50m54s 50m54s   ./D0reco_x -rcp runD0reco_data.rcp -out re*
  9798            0      0  tcsh -f /home/d0farm/v00-03-01/farm_machine*
  9800            0      0   sleep 28800

...........

..........

Process #10 (5713.WORKER_JOB.10)  on fnd073 Status: running
  PID           CPU   ACPU Command
  ---------- ------ ------ ----------------------------------------
  10919           0 50m29s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
  11065           0      0  tcsh -f /home/d0farm/v00-03-01/farm_machine*
  11067           0      0   sleep 28800
  11105       0m45s 50m29s  python -u /home/d0farm/v00-03-01/farm_machi*
  13017      49m44s 49m44s   ./D0reco_x -rcp runD0reco_data.rcp -out re*
  11759           0      0   (python)
  ------------------------------------------------------------------------

  Section ID: 5713.END   Name: END

  Process resources: IO:1
  Section resources:
  Start Time: Not Started                 End Time: Not Finished
  Hold Time:                              Prio: 0
  Exec:       /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/stopProject.*
  State:      waiting                     Depend: ended(WORKER_JOB)
  ProcType:   EndSAM                      Queue: EndQueue
  NumProc:    1                           Nice: 0
  ------------------------------------------------------------------------
 

<d0bbin> sam list definitions --defname=shiftset%raw

Dataset Def Name                                   Create Date         User Name  Work G
roup
shiftset-2001-08-03-080844-raw                     08/03/2001 09:09:06 schellma   dzero

shiftset-2001-08-03-180205-raw                     08/03/2001 19:02:28 schellma   dzero

shiftset-2001-08-04-152004-raw                     08/04/2001 16:20:24 schellma   dzero

shiftset-2001-08-04-152901-raw                     08/04/2001 16:29:10 schellma   dzero

shiftset-2001-08-04-162224-raw                     08/04/2001 17:22:35 schellma   dzero

shiftset-eve-01aug04-raw                           08/05/2001 08:44:53 schellma   demo

shiftset-day-01aug05-raw                           08/05/2001 19:53:42 schellma   dzero

shiftset-day-09-aug-2001-raw                       08/09/2001 17:33:02 schellma   dzero

shiftset-owl-09-aug-2001-raw                       08/09/2001 17:33:49 schellma   dzero
 

<d0bbin> make_store shiftset-owl-09-aug-2001-raw t01.54.00
Files:
   halo_0000127640_001.raw
   halo_0000127640_002.raw
   halo_0000127640_003.raw
   halo_0000127640_004.raw
   halo_0000127640_005.raw
   halo_0000127640_006.raw
   halo_0000127640_007.raw
   halo_0000127640_008.raw
   halo_0000127640_009.raw
   halo_0000127640_010.raw

File Count:  0
Average File Size:  255781

Pause for 15 seconds, can CTL-C if aren't interested

Files:
   halo_0000127640_001.raw
   halo_0000127640_002.raw
   halo_0000127640_003.raw
   halo_0000127640_004.raw
   halo_0000127640_005.raw
   halo_0000127640_006.raw
   halo_0000127640_007.raw
   halo_0000127640_008.raw
   halo_0000127640_009.raw
   halo_0000127640_010.raw

File Count:  0
Average File Size:  255781

Dataset definition created with Id:  6603

Data set is farm-<shiftset-name>-<version>

<d0bbin> runrecocert farm-shiftset-owl-09-aug-2001-raw-t01.54.00 10 t01.54.00 TitaniumQ
Number of nodes choosen:  10
Recostruction version applied  t01.54.00
we are running against:  prd
disk  /d0/stripe7
recon&root/recon/root:  recon_root
number of events for reconstruction:  0
Date: Aug10
Cannot create directory "Aug10": File exists
Farm Job 5714 has been submitted...