GETTING READY
CHECKING ON PRESENT STATUS
STARTING NEW JOBS
SUBMITTING A JOB
EXAMPLES
Shifters need to log in at least twice a day, at about 9AM and after 5PMlog into d0bbin as d0farm
kinit as yourself again
type
source FARM_SETUPthis will put you in the run directory and set the $D0FARM_DIR environmental to point
to the location of the scripts
Look at the FBSWWW display to see what's going on in a global way:http://www-isd.fnal.gov/cgi-bin/fbsng/fbswww/fbswww?action=graphs&farm=D0
type
this will give a dump (once again taking a long time) of the status of the farm.
You can see individual jobs by doing
fbs lj to see a list of jobs
listjobs.py <jobno> to see a single job's details
fbs status to see all jobs in gory detail
fbs status <jobno> to see details for a particular job
1. First find out what version we are running from the previous shifter.On August 10th it was t01.54.00
2. Check to see if new shift datasets have come in from the control room.
The shift captains will be creating datasets with keyword shift-rawset or store-dataset.
The store-datasets are the old instructions and the shift-rawset should become the
standard soon.To get a fast list:
sam list definitions --defname=shiftset%rawsam list definitions --defname=store% | grep -v 1x8
To get a detailed list with status:check_shift <version>
This will take some time but will create a list of the shiftsets and how many files have
been processed through that version.If there are no shift datasets, you may need to call the control room and what runs
were good from the last shift. Until the new procedures get in place, shift captains
are probably assuming that they can wait until the end of the store, which may be
days long.
First you have to transform the shift dataset into one for the farms. This just
adds a check to see if the files have already been processed through the same
version:
make_store <shiftset-name> <version>This will make a new dataset definition farm-<shiftset-name>-<version>
and put a listing in the ./projects subdirectoryTo submit a job:
runrecocert <dataset> <number_of_nodes> <version> <queue>
runrecocert farm-shiftset-owl-09-aug-2001-raw-t01.54.00 10 t01.54.00 TitaniumQ
Queue is normally TitaniumQ which has 76 fast CPU's with 512 MB/processor
Number of nodes should be < ~40 and also < # of files.To check on the job:
The logs will go into
~d0farm/run/<date> (script logs)
or into
/d0/stripeX/samtest/<jobno>/logs and /d0/stripeX/samtest/<jobno>/badinput/logs (framework logs)
<d0bbin> fbs lj
JobID State User
Sections
------ ---------- -------- ----------------------
5704 running d0farm END:d
START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5705 running d0farm END:d
START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5707 running d0farm END:d
START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5712 running d0farm END:w
START_SAM:* WORKER_JOB:* _JOB_CONTROL:*
WORKER_JOB is the LINUX CPU
START_SAM is the master control job
_JOB_CONTROL is the monitor process
* means running,
d means dead,
w means waiting for other job section,
r means waiting for resources
<d0bbin> checknodes
Summary of activity for user d0farm
Batch status: 37 jobs, 33 D0reco_x, 0 RecoAnalyze_x,
3 waiting for data
File stores in progress:
0
ls: No match.
Files which need to be stored: queued: 0
to be queued: 1
Here there are 37 jobs running on worker nodes, 33 are doing D0reco_x,
3 are waiting for data
and one is doing something else.
d0bbin > listjobs.py 5712
----------------------- 5712 --------------------
Status of Batch job: 5712
Analysis Project: farm.t01.54.00.5712
Project Definition: farm-shiftset-eve-01aug04-raw-t01.54.00
Files to go:
Output Buffer:
/d0/stripe7/samtest/5712
Batch Queue:
TitaniumQ
Batch status: 20 jobs, 17 D0reco_x, 0 RecoAnalyze_x,
0 waiting for data, 3 finished
File stores in progress:
0
Sam Project Status: For files: 1..32
errors: 11 in progress: 20 finished 0
Sam Consumer ID:
28256
Of 32 project files, umer saw 11 (20 good + 11 bad).
Umer failed to process 0 good files and missed 0 files
Number of files which crashed: 1
Number of files which are ok: 1
Machines:d0bbin, d0bbin, fnd080, fnd078, fnd079, fnd075, fnd076,
fnd077, fnd071, fnd072, fnd073, fnd081, fnd083, fnd082, fnd085, fnd084,
fnd087, fnd088, fnd056
5712 is the job number
Output buffer is where the data/logs will go
Batch Status tells you what the workers are doing
Sam Project status:
error = can't get the file
in progress = file has been cached, not finished yet
finished = released by d0 code after processing
In this job, two file are actually done, one crashed and the other
was ok.
<d0bbin> fbs status 5713
Section ID: 5713._JOB_CONTROL Name: _JOB_CONTROL
Process resources: IO:1
Section resources: StartSections:1
Start Time: Fri Aug 10 09:04:22 2001 End
Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/jobControl.c*
State: running
Depend:
ProcType: StartSAM
Queue: StartQueue
NumProc: 1
Nice: 0
Process #1 (5713._JOB_CONTROL.1) on d0bbin Status:
running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
616092 0m01s 0m52s /bin/tcsh
-f /home/d0farm/v00-03-01/farm_mac*
616007 0m51s 0m51s
python -u /home/d0farm/v00-03-01/farm_machi*
------------------------------------------------------------------------
Section ID: 5713.START_SAM Name: START_SAM
Process resources: IO:1
Section resources: StartSections:1
Start Time: Fri Aug 10 09:04:22 2001 End
Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startProject*
State: running
Depend: started(_JOB_CONTROL)
ProcType: StartSAM
Queue: StartQueue
NumProc: 1
Nice: 0
Process #1 (5713.START_SAM.1) on
d0bbin Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
616075 0m01s 0m54s /bin/tcsh
-f /home/d0farm/v00-03-01/farm_mac*
612846 0m53s 0m53s
python -u /home/d0farm/v00-03-01/farm_machi*
632733
0 0 sleep 600
615346
0 0 /bin/tcsh -f /home/d0farm/v00-03-01/farm_ma*
616469
0 0 sleep 28800
------------------------------------------------------------------------
Section ID: 5713.WORKER_JOB Name: WORKER_JOB
Process resources: Titanium:1 cpu:100
Section resources:
Start Time: Fri Aug 10 09:04:23 2001 End
Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startWorker.*
State: running
Depend: started(START_SAM)
ProcType: Worker_15
Queue: TitaniumQ
NumProc: 10
Nice: 0
Process #1 (5713.WORKER_JOB.1) on fnd080
Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
15889
0 51m58s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
16075 0m46s 51m58s
python -u /home/d0farm/v00-03-01/farm_machi*
17984 51m12s 51m12s
./D0reco_x -rcp runD0reco_data.rcp -out re*
16702
0 0 (python)
16035
0 0 tcsh -f /home/d0farm/v00-03-01/farm_machine*
16037
0 0 sleep 28800
Process #2 (5713.WORKER_JOB.2) on fnd078
Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
9652
0 51m41s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
9825 0m47s 51m41s
python -u /home/d0farm/v00-03-01/farm_machi*
10465
0 0 (python)
11720 50m54s 50m54s
./D0reco_x -rcp runD0reco_data.rcp -out re*
9798
0 0 tcsh -f /home/d0farm/v00-03-01/farm_machine*
9800
0 0 sleep 28800
...........
..........
Process #10 (5713.WORKER_JOB.10) on fnd073 Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
10919
0 50m29s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
11065
0 0 tcsh -f /home/d0farm/v00-03-01/farm_machine*
11067
0 0 sleep 28800
11105 0m45s 50m29s
python -u /home/d0farm/v00-03-01/farm_machi*
13017 49m44s 49m44s
./D0reco_x -rcp runD0reco_data.rcp -out re*
11759
0 0 (python)
------------------------------------------------------------------------
Section ID: 5713.END Name: END
Process resources: IO:1
Section resources:
Start Time: Not Started
End Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/stopProject.*
State: waiting
Depend: ended(WORKER_JOB)
ProcType: EndSAM
Queue: EndQueue
NumProc: 1
Nice: 0
------------------------------------------------------------------------
<d0bbin> sam list definitions --defname=shiftset%raw
Dataset Def Name
Create Date User Name
Work G
roup
shiftset-2001-08-03-080844-raw
08/03/2001 09:09:06 schellma dzero
shiftset-2001-08-03-180205-raw 08/03/2001 19:02:28 schellma dzero
shiftset-2001-08-04-152004-raw 08/04/2001 16:20:24 schellma dzero
shiftset-2001-08-04-152901-raw 08/04/2001 16:29:10 schellma dzero
shiftset-2001-08-04-162224-raw 08/04/2001 17:22:35 schellma dzero
shiftset-eve-01aug04-raw 08/05/2001 08:44:53 schellma demo
shiftset-day-01aug05-raw 08/05/2001 19:53:42 schellma dzero
shiftset-day-09-aug-2001-raw 08/09/2001 17:33:02 schellma dzero
shiftset-owl-09-aug-2001-raw
08/09/2001 17:33:49 schellma dzero
<d0bbin> make_store shiftset-owl-09-aug-2001-raw
t01.54.00
Files:
halo_0000127640_001.raw
halo_0000127640_002.raw
halo_0000127640_003.raw
halo_0000127640_004.raw
halo_0000127640_005.raw
halo_0000127640_006.raw
halo_0000127640_007.raw
halo_0000127640_008.raw
halo_0000127640_009.raw
halo_0000127640_010.raw
File Count: 0
Average File Size: 255781
Pause for 15 seconds, can CTL-C if aren't interested
Files:
halo_0000127640_001.raw
halo_0000127640_002.raw
halo_0000127640_003.raw
halo_0000127640_004.raw
halo_0000127640_005.raw
halo_0000127640_006.raw
halo_0000127640_007.raw
halo_0000127640_008.raw
halo_0000127640_009.raw
halo_0000127640_010.raw
File Count: 0
Average File Size: 255781
Dataset definition created with Id: 6603
Data set is farm-<shiftset-name>-<version>
<d0bbin> runrecocert farm-shiftset-owl-09-aug-2001-raw-t01.54.00
10 t01.54.00 TitaniumQ
Number of nodes choosen: 10
Recostruction version applied t01.54.00
we are running against: prd
disk /d0/stripe7
recon&root/recon/root: recon_root
number of events for reconstruction: 0
Date: Aug10
Cannot create directory "Aug10": File exists
Farm Job 5714 has been submitted...