GETTING READY
WHAT TO DO
CHECKING ON PRESENT STATUS
STARTING NEW JOBS
THE REQUEST SYSTEM
MAKING REQUESTS
APPROVING REQUESTS
ACTIVATING REQUESTS
CHECKING on JOBS
CHECKING AND RESUBMITTING
WEEKLY REPORTS
Shifters need to log in at least twice a day, at about 9AM and after 5PMlog into d0bbin as d0farm
kinit as yourself again
type
source FARM_SETUPthis will put you in the run directory and set the $D0FARM_DIR environmental to point
to the location of the scripts
Your first priority is to submit new jobs, then try to get ones that failed to work.
Jobs can fail to process all files for several reasons:
File was not in sam yet from online when job submitted (unavoidable)
File was not delivered to farm because of tape failure (common)
File was not delivered to farm because of staging or db failure
(rare but a real bad thing when it happens)
File was delivered but reco or reco_analyze crashed (rare)
File was delivered but node crashed due to memory overload or disk
failure (rare)
File was processed but not copied back to the I/O node (rare)
File has not been stored to tape yet due to tape failure (common)
Look at the FBSWWW display to see what's going on in a global way:STARTING NEW JOBS:http://www-isd.fnal.gov/cgi-bin/fbsng/fbswww/fbswww?action=graphs&farm=D0
http://d0db.fnal.gov/sam_farm_request/ for the status of present requests
You can see individual jobs by doing
listprojects to see a list of jobs
listjobs.py <jobno> to see a single job's details
fbs status to see all jobs in gory detail
fbs status <jobno> to see details for a particular jobtype
this will give a dump (once again taking a long time) of the status of the farm.
1. First find out what version we are running from the previous shifter.On Nov 29 it was p10.11.00
2. Check to see if you need to create new daily shift sets (now called 'daysets')
sam list definitions --defname=dayset% will list the existing ones.
Is yesterday's data included?If not, type
make_day <MM/DD/YYYY>
We have a new system, the request system at
http://d0db.fnal.gov/sam_farm_request/
This allows you to submit requests, which consist of a project
name and a code version.
Each request will be sent to the farm and can be made to run multiple
times until all files are processed or
you decide it's not worth pursuing it anymore.
The request system has 3 phases:
1) Request, any user can do this
2) Approval, this can only be done by someone with administrator privileges
like you. This allows you
to approve, hold or finish requests
3) Activation: This is done by running a script on the farms,
in future it will be automatic but for now it's
done by hand.
Go to http://d0db.fnal.gov/sam_farm_request/
click on FARM REQUEST button
fill in the following fields
REQUIRED:
- Name: <your id>
- Project Name: <the project definition>
- Appl Name Version: <the reco version>
OPTIONAL AND USEFUL but you can use the defaults:
The rest of the terms are not useful yet and should be ignored.
- Number of events - 0 means all, don't change this unless you are using Application Name Version "recotest" otherwise truncated files will be stored back into sam.
- Comment - feel free to comment
- Number of nodes - 0 gives a useful default, but you can set it yourself, for large datasets, set it to 30 or 40 max.
- Queue - production_lo is guaranteed to run someday, production_hi is faster, production_fast runs only on fast nodes production_slow runs only on slow nodes, test runs only on test nodes.
fill in the admin account and password and hit the ADMIN button
Then hit list to see a bunch of requests
ACTIVATING THE JOBS
The requests page just keeps track of requests, a separate system tells the batch system to talk to the requests and activate them.Log in to a CLEAN session on d0bbin as d0farm
type
source FARM_SETUPstart_approved_jobs
which for now produces a huge amount of stuff - we're working on it.This will start any jobs in the approved or partial state. This is all you have to o to
To check on the job:
The logs will go into
~d0farm/run/<date> (script logs)
or into
/d0/stripeX/samtest/<jobno>/logs and /d0/stripeX/samtest/<jobno>/badinput/logs (framework logs)
To see which jobs are running
This will summarize job numbers,
the project names and the sections which are working.
Disks should be less than 90% full /stripe4 and stripe9 are an exception
- they are used by other groups.
/stripe8 is used for test output.
listfast tells you if the file stores for running jobs are working
ps -ef | grep -c storeafile3.py
tells you how many file stores are queued up.
This will be automated soon but for now it is a royal, extreme pain.
Every couple of days, you want to resubmit a bunch of old jobs.
You can find the status of existing jobs by looking at
the check_shift dump
and going to the
~/run/jobdumps directory
and running the script
check_dump on the dump files which are created
at end of job
There is a script in
~schellma/v03-01-00/farm_machinery/samutils on d0mino called
production_summary.csh
production_summary.csh 09/10/2001 09/16/2001 t01.56.00
will give t01.56 summary for dates between the 10th and 16th inclusive.
You need to post this in
~WWW/docs/computing/production/weekly_reports
<d0bbin> fbs lj
JobID State User
Sections
------ ---------- -------- ----------------------
5704 running d0farm END:d
START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5705 running d0farm END:d
START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5707 running d0farm END:d
START_SAM:* WORKER_JOB:d _SAM_JOB_END:w _...
5712 running d0farm END:w
START_SAM:* WORKER_JOB:* _JOB_CONTROL:*
WORKER_JOB is the LINUX CPU
START_SAM is the master control job
_JOB_CONTROL is the monitor process
* means running,
d means dead,
w means waiting for other job section,
r means waiting for resources
<d0bbin> checknodes
Summary of activity for user d0farm
Batch status: 37 jobs, 33 D0reco_x, 0 RecoAnalyze_x,
3 waiting for data
File stores in progress:
0
ls: No match.
Files which need to be stored: queued: 0
to be queued: 1
Here there are 37 jobs running on worker nodes, 33 are doing D0reco_x,
3 are waiting for data
and one is doing something else.
d0bbin > listjobs.py 5712
----------------------- 5712 --------------------
Status of Batch job: 5712
Analysis Project: farm.t01.54.00.5712
Project Definition: farm-shiftset-eve-01aug04-raw-t01.54.00
Files to go:
Output Buffer:
/d0/stripe7/samtest/5712
Batch Queue:
TitaniumQ
Batch status: 20 jobs, 17 D0reco_x, 0 RecoAnalyze_x,
0 waiting for data, 3 finished
File stores in progress:
0
Sam Project Status: For files: 1..32
errors: 11 in progress: 20 finished 0
Sam Consumer ID:
28256
Of 32 project files, umer saw 11 (20 good + 11 bad).
Umer failed to process 0 good files and missed 0 files
Number of files which crashed: 1
Number of files which are ok: 1
Machines:d0bbin, d0bbin, fnd080, fnd078, fnd079, fnd075, fnd076,
fnd077, fnd071, fnd072, fnd073, fnd081, fnd083, fnd082, fnd085, fnd084,
fnd087, fnd088, fnd056
5712 is the job number
Output buffer is where the data/logs will go
Batch Status tells you what the workers are doing
Sam Project status:
error = can't get the file
in progress = file has been cached, not finished yet
finished = released by d0 code after processing
In this job, two file are actually done, one crashed and the other
was ok.
<d0bbin> fbs status 5713
Section ID: 5713._JOB_CONTROL Name: _JOB_CONTROL
Process resources: IO:1
Section resources: StartSections:1
Start Time: Fri Aug 10 09:04:22 2001 End
Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/jobControl.c*
State: running
Depend:
ProcType: StartSAM
Queue: StartQueue
NumProc: 1
Nice: 0
Process #1 (5713._JOB_CONTROL.1) on d0bbin Status:
running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
616092 0m01s 0m52s /bin/tcsh
-f /home/d0farm/v00-03-01/farm_mac*
616007 0m51s 0m51s
python -u /home/d0farm/v00-03-01/farm_machi*
------------------------------------------------------------------------
Section ID: 5713.START_SAM Name: START_SAM
Process resources: IO:1
Section resources: StartSections:1
Start Time: Fri Aug 10 09:04:22 2001 End
Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startProject*
State: running
Depend: started(_JOB_CONTROL)
ProcType: StartSAM
Queue: StartQueue
NumProc: 1
Nice: 0
Process #1 (5713.START_SAM.1) on
d0bbin Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
616075 0m01s 0m54s /bin/tcsh -
612846 0m53s 0m53s
python -u /home/d0farm/v00-03-01/farm_machi*
632733
0 0 sleep 600
615346
0 0 /bin/tcsh -f /home/d0farm/v00-03-01/farm_ma*
616469
0 0 sleep 28800
------------------------------------------------------------------------
Section ID: 5713.WORKER_JOB Name: WORKER_JOB
Process resources: Titanium:1 cpu:100
Section resources:
Start Time: Fri Aug 10 09:04:23 2001 End
Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/startWorker.*
State: running
Depend: started(START_SAM)
ProcType: Worker_15
Queue: TitaniumQ
NumProc: 10
Nice: 0
Process #1 (5713.WORKER_JOB.1) on fnd080
Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
15889
0 51m58s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
16075 0m46s 51m58s
python -u /home/d0farm/v00-03-01/farm_machi*
17984 51m12s 51m12s
./D0reco_x -rcp runD0reco_data.rcp -out re*
16702
0 0 (python)
16035
0 0 tcsh -f /home/d0farm/v00-03-01/farm_machine*
16037
0 0 sleep 28800
Process #2 (5713.WORKER_JOB.2) on fnd078
Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
9652
0 51m41s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
9825 0m47s 51m41s
python -u /home/d0farm/v00-03-01/farm_machi*
10465
0 0 (python)
11720 50m54s 50m54s
./D0reco_x -rcp runD0reco_data.rcp -out re*
9798
0 0 tcsh -f /home/d0farm/v00-03-01/farm_machine*
9800
0 0 sleep 28800
...........
..........
Process #10 (5713.WORKER_JOB.10) on fnd073 Status: running
PID
CPU ACPU Command
---------- ------ ------ ----------------------------------------
10919
0 50m29s tcsh -f /home/d0farm/v00-03-01/farm_machiner*
11065
0 0 tcsh -f /home/d0farm/v00-03-01/farm_machine*
11067
0 0 sleep 28800
11105 0m45s 50m29s
python -u /home/d0farm/v00-03-01/farm_machi*
13017 49m44s 49m44s
./D0reco_x -rcp runD0reco_data.rcp -out re*
11759
0 0 (python)
------------------------------------------------------------------------
Section ID: 5713.END Name: END
Process resources: IO:1
Section resources:
Start Time: Not Started
End Time: Not Finished
Hold Time:
Prio: 0
Exec: /home/d0farm/v00-03-01/farm_machinery/v3_fbsng/stopProject.*
State: waiting
Depend: ended(WORKER_JOB)
ProcType: EndSAM
Queue: EndQueue
NumProc: 1
Nice: 0
------------------------------------------------------------------------
<d0bbin> sam list definitions
--defname=shiftset%sep%
Dataset Def Name Create Date User Name Work Group
shiftset-eve-01-sep-2001 09/01/2001 17:12:54 d0run dzero
shiftset-owl-02-sep-2001 09/02/2001 06:51:39 d0run dzero
shiftset-day-02-sep-2001 09/02/2001 16:59:15 d0run dzero
shiftset-owl-04-sep-2001 09/04/2001 08:54:56 d0run dzero
shiftset-day-04-sep-2001 09/04/2001 17:19:32 d0run dzero
shiftset-eve-04-sep-2001 09/05/2001 01:29:10 d0run dzero
shiftset-day-05-sep-2001 09/05/2001 16:46:20 d0run dzero
shiftset-eve-05-sep-2001 09/06/2001 00:58:39 d0run dzero
shiftset-day-06-sep-2001 09/06/2001 17:00:28 d0run dzero
shiftset-eve-06-sep-2001 09/07/2001 01:06:28 d0run dzero
shiftset-day-08-sep-2001 09/08/2001 16:45:14 d0run dzero
shiftset-owl-09-sep-2001 09/09/2001 08:53:29 d0run dzero
shiftset-day-09-sep-2001 09/09/2001 17:09:20 d0run dzero
shiftset-eve-09-sep-2001 09/10/2001 01:09:47 d0run dzero
shiftset-owl-10-sep-2001 09/10/2001 08:30:30 d0run dzero
shiftset-eve-08-sep-2001 09/10/2001 18:56:06 d0run dzero
shiftset-day-14-sep-2001 09/14/2001 17:33:36 d0run dzero
shiftset-eve-14-sep-2001 09/15/2001 01:11:21 d0run dzero
shiftset-owl-15-sep-2001 09/15/2001 08:34:35 d0run dzero
shiftset-owl-15-sep-2001b 09/15/2001 08:35:21 d0run dzero
shiftset-eve-15-sep-2001 09/16/2001 01:30:44 d0run dzero
shiftset-owl-16-sep-2001 09/16/2001 08:54:47 d0run dzero
shiftset-day-16-sep-2001 09/16/2001 16:49:43 d0run dzero
<d0bbin> make_day
12/05/2001
Files:
halo_0000127640_001.raw
halo_0000127640_002.raw
halo_0000127640_003.raw
halo_0000127640_004.raw
halo_0000127640_005.raw
halo_0000127640_006.raw
halo_0000127640_007.raw
halo_0000127640_008.raw
halo_0000127640_009.raw
halo_0000127640_010.raw
File Count: 0
Average File Size: 255781
Pause for 15 seconds, can CTL-C if aren't interested
Files:
halo_0000127640_001.raw
halo_0000127640_002.raw
halo_0000127640_003.raw
halo_0000127640_004.raw
halo_0000127640_005.raw
halo_0000127640_006.raw
halo_0000127640_007.raw
halo_0000127640_008.raw
halo_0000127640_009.raw
halo_0000127640_010.raw
File Count: 0
Average File Size: 255781
Dataset definition created with Id: 6603
Data set is farm-nb-<shiftset-name>-<version>
<d0bbin> runrecocertfarm-shiftset-owl-09-aug-2001-raw-t01.54.00
10 t01.54.00 TitaniumQ
Number of nodes choosen: 10
Recostruction version applied t01.54.00
we are running against: prd
disk /d0/stripe7
recon&root/recon/root: recon_root
number of events for reconstruction: 0
Date: Aug10
Cannot create directory "Aug10": File exists
Farm Job 5714 has been submitted...
listprojectslistprojects
7080 farm-nb-shiftset-day-06-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB:* _JOB_CONTROL:*
7084 farm-nb-shiftset-owl-04-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB:* _JOB_CONTROL:*
7097 farm-nb-shiftset-eve-15-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7381 farm-nb-shiftset-owl-10-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:d _...
7474 farm-nb-shiftset-day-05-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7475 farm-nb-shiftset-eve-09-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7476 farm-nb-shiftset-day-08-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7478 farm-recocert-p09.08.00-sim-p10.04.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:* _...
7479 farm-nb-recocert-129194-raw-p10.04.00 running END:d START_SAM:* WORKER_JOB1:d WORKER_JOB2:d _...
7487 farm-nb-shiftset-owl-16-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:d WORKER_JOB2:* _...
7488 farm-nb-shiftset-owl-15-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...
7520 farm-nb-shiftset-owl-23-sep-2001-t01.56.00 running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:d _...
7560 recocert-129194-raw running END:w START_SAM:* WORKER_JOB1:* WORKER_JOB2:f _...This shows the status of the various sections of the job and which project it is running.
'*' means running
'r' means 'ready' This is bad as it means hung
by some other queue
'd' means 'dead' Can happen for WORKER_JOB2 sections
if you run a null section
'w' means waiting - normal for END sections, they wait for the WORKER
sections
'f' means failed Can also happen for WORKER_JOB2
sections
'c' means canceled
/dev/root
xfs 13673424 7113920 6559504 53 /
/dev/xlv/stripe9 xfs 142161776
133177520 8984256 94 /d0/stripe9
/dev/xlv/stripe5 xfs 142161776
65462904 76698872 47 /d0/stripe5
/dev/xlv/stripe8 xfs 142161776
47465048 94696728 34 /d0/stripe8
/dev/xlv/stripe7 xfs 142161776
28237224 113924552 20 /d0/stripe7
/dev/xlv/stripe6 xfs 142161776
12048416 130113360 9 /d0/stripe6
/dev/xlv/d0farm xfs
35534776 30013544 5521232 85 /export/d0farm
/dev/xlv/stripe4 xfs 142162560
117335864 24826696 83 /d0/stripe4
/dev/xlv/stripe3 xfs 142161280
97439040 44722240 69 /d0/stripe3
/dev/xlv/crash
xfs 35534776 4518384 31016392 13 /var/adm/crash
/dev/xlv/stripe2 xfs 142162560
53248816 88913744 38 /d0/stripe2
/dev/xlv/stripe1 xfs 142161280
46144600 96016680 33 /d0/stripe1
/dev/dsk/dks2d8s1 xfs 6284240
3480872 2803368 56 /export/products
/dev/dsk/dks2d8s4 xfs 6259192
8600 6250592 1 /export/usr/local
/dev/dsk/dks2d8s0 xfs 11480952
4677144 6803808 41 /export/home
disks stripe1,2,3,5,6 and 7 should not be near 90%. The worker
jobsshould stall if it hits 90% on a disk.
check_dump farm-nb-shiftset-eve-16-sep-2001-t01.56.00_7299.dump
File: farm-nb-shiftset-eve-16-sep-2001-t01.56.00_7299.dump
Files in Project 42
Good files (32
Undelivered files: 10
Reco crashed: 2
Have 12 you probably can't process
Undelivered include 0/0 NOACCESS/NOTALLOWED files and 0 which timed
out
As the undelivered files were not NOACCESS, I'd resubmit this one
check_dump farm-nb-shiftset-owl-18-sep-2001-t01.56.00_7119.dump
File: farm-nb-shiftset-owl-18-sep-2001-t01.56.00_7119.dump
Files in Project 47
Good files (47
Undelivered files: 0
Reco crashed: 0
Have 0 you probably can't process
Undelivered include 0/0 NOACCESS/NOTALLOWED files and 0 which timed
out
This one is DONE!!
check_dump farm-nb-shiftset-owl-22-sep-2001-t01.56.00_7490.dump
File: farm-nb-shiftset-owl-22-sep-2001-t01.56.00_7490.dump
Files in Project 123
Good files (0
Undelivered files: 123
Reco crashed: 0
Have 123 you probably can't process
Undelivered include 123/0 NOACCESS/NOTALLOWED files and 0 which
timed out
Your worst nightmare - all of the files, every single last one, is on a bad tape. Bummer..
May be able to rerun in a couple of days.
<d0mino> production_summary.csh