D0 Production page SAM Batch System
Need to log in to d0mino or some other fermilab machine, then ssh to d0bbin and the farm nodes. The worker nodes are not directly accessible from outside Fermilab.
The official submission directory can be found in
~d0farm/<version>/farm_machinery/v1_fbsng
version is currently v00-01-00
One can get a new version by
cd mkdir <version> cd <version> cvs checkout -r <version> farm_machinery
setup fbsng
Farms are divided into 2 and then into smaller subsections. Each node has one or more color attribute and queues
The queue TitaniumQ runs on 49 of the 800 Mhz duals for 98 total processors The queue BlueQ runs on 26 of the slower 500 MhX duals for 52 processors. The farms are divided up into 10 farmlets and 2 test machines.
The farmlets Worker_1 through Worker_5 are old 500 MhZ duals.
Worker_test is a single node fnd01 for debugging. Its queue is PINKQ
Worker_1 has 9 nodes and is for farm code testing. Its queue is GREENQ
Worker_2 is currently assigned to David Adams for cft work
Worker_3-5 are ORANGEQ, PURPLEQ and YELLOWQ and can be combined into queue BLUEQ
Worker_5 has only 7 nodes so BLUEQ can have 27 nodes.
All 47 slow nodes are queue WHITEQ
farmlets Worker_6 - Worker_10 are new 800 Mhz duals.
Worker_test_1 is fnd051 for testing. Queue is DoveQ
Worker_6 through Worker_10 are VermilionQ, ChartreuseQ, SienaQ, LavenderQ and CadmiumQ.
Worker_7-Worker_10 are CobaltQ
Worker_6-Worker_10 are TitaniumQ which has 49 fast nodes
fbs queues Shows you the names of the queues and their assigned process types.
fbs hosts Shows which worker is which.
To really see the setup you currently have to decode
farms/fbsng_root/cfg/farm.cfg.
We need a summary that tells you how many workers are assigned to which queue automatically. Generally one would submit to BlueQ and TitaniumQ. The others are only used for testing.
Currently the command is:
runrecocert
project
number of nodes
version of code
queue
typing runrecocert will dump the full argument list
fbs monitor is a gui which allows you to track jobs (and kill them.)
The logs of reco go into /d0/stripeX/samtest/
jobno
/logs or
/d0/stripeX/samtest/
jobno
/badinput/logs
where badinput is where jobs that fail end up.
Jobs have 3 sections. They communicate by putting a little file
jobno
.CID
in the
/samjobs directory.
go to the local log area
grep STATUS sam_
jobno
.log will tell you a lot.
Try
grep Failure *jobno*.out
The END section writes a summary of the project, look in the file project_jobno.dump. This file appears when all files have been tried and includes a list of the bad ones.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 farm_debug
The translation was initiated by Heidi Schellman on 2001-01-09