next_group up previous


Documentation

D0 Production page SAM Batch System

Logging in

Need to log in to d0mino or some other fermilab machine, then ssh to d0bbin and the farm nodes. The worker nodes are not directly accessible from outside Fermilab.

The official submission directory can be found in


~d0farm/<version>/farm_machinery/v1_fbsng

version is currently v00-01-00

One can get a new version by


cd
mkdir <version>
cd <version>
cvs checkout -r <version> farm_machinery

Submitting jobs

Make certain farm batch system is set up

The farm batch system must be set up before a job is submitted

setup fbsng

What queues are available

Farms are divided into 2 and then into smaller subsections. Each node has one or more color attribute and queues

The queue TitaniumQ runs on 49 of the 800 Mhz duals for 98 total processors The queue BlueQ runs on 26 of the slower 500 MhX duals for 52 processors. The farms are divided up into 10 farmlets and 2 test machines.

The farmlets Worker_1 through Worker_5 are old 500 MhZ duals.

Worker_test is a single node fnd01 for debugging. Its queue is PINKQ
Worker_1 has 9 nodes and is for farm code testing. Its queue is GREENQ
Worker_2 is currently assigned to David Adams for cft work
Worker_3-5 are ORANGEQ, PURPLEQ and YELLOWQ and can be combined into queue BLUEQ
Worker_5 has only 7 nodes so BLUEQ can have 27 nodes.
All 47 slow nodes are queue WHITEQ

farmlets Worker_6 - Worker_10 are new 800 Mhz duals.

Worker_test_1 is fnd051 for testing. Queue is DoveQ
Worker_6 through Worker_10 are VermilionQ, ChartreuseQ, SienaQ, LavenderQ and CadmiumQ.
Worker_7-Worker_10 are CobaltQ Worker_6-Worker_10 are TitaniumQ which has 49 fast nodes

fbs queues Shows you the names of the queues and their assigned process types.

fbs hosts Shows which worker is which.

To really see the setup you currently have to decode $\sim$farms/fbsng_root/cfg/farm.cfg.

We need a summary that tells you how many workers are assigned to which queue automatically. Generally one would submit to BlueQ and TitaniumQ. The others are only used for testing.

Submitting the job

Currently the command is:

runrecocert $<$project$>$ $<$number of nodes$>$ $<$ version of code$>$ $<$queue$>$

typing runrecocert will dump the full argument list

Checking on a job

fbs monitor is a gui which allows you to track jobs (and kill them.)

How a job is submitted

runrecocert is currently a script which modifies a 'jdf' file template. You can change this jdf template to send you mail when each job section terminates. The jdf template for the most recent submission is stored in the directory you submitted the job from.

How a job runs

Logs

The logs of the batch scripts go into a local subdirectory MONDD with the date coded in it.

The logs of reco go into /d0/stripeX/samtest/$<$jobno$>$/logs or /d0/stripeX/samtest/$<$jobno$>$/badinput/logs

where badinput is where jobs that fail end up.

Jobs have 3 sections. They communicate by putting a little file $<$jobno$>$.CID in the $\sim$/samjobs directory.

The START section

This starts up the sam data delivery and then sits there until 10 hours after everything has finished, trying to store all of the output files. Killing the start section is not a good idea. You can tell it to shut down by getting rid of the $<jobno>$.CID file.

The Worker section

This is the section that starts on all of the worker nodes and runs the actual jobs. It uses SAM for data access.

Looking for failures in the batch/sam system

go to the local log area

Tape errors

These show up in sam_$<$jobno$>$.log .

grep STATUS sam_$<$jobno$>$.log will tell you a lot.

General errors in sam or scripts

Try

grep Failure *jobno*.out

Check on processing summary

The END section writes a summary of the project, look in the file project_jobno.dump. This file appears when all files have been tried and includes a list of the bad ones.

About this document ...

Debugging the D0 farms

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 farm_debug

The translation was initiated by Heidi Schellman on 2001-01-09


next_group up previous
Heidi Schellman
2001-01-09