All Common Sample production jobs run under account csgprod, which
exists on both central systems (d0mino0X, d0srvXXX, cab) and clued0.
The above version of the kinit command is attached to the alias
pcron when you log in to csgprod.
If you want to be able to obtain a ticket on a particular clued0 node that
doesn't have a copy of the keytab, simply put a copy of the keytab in the
standard place (note that it is a violation of Fermilab security policy to
put a copy of a keytab on an nfs-shared filesystem). It may be necessary
to run the privileged command kcroninit to create the keytab file,
which you can then overwrite, or have root install the keytab.
In the case of skimming and fixing projects, the input dataset usually
consists of all-stream files produced by the production farm. No
predefined dataset exists for such files, so in these cases it is
necessary to define an input dataset.
Here is the standard way to define an all-stream dataset.
Sam does not allow arbitrary application names and versions. Only
names and versions that have previously been registered in the sam
database can be used. Here are examples of commands to register new
applications and versions (requires sam admin privilege).
Skimming:
Caffing:
Fixing:
Note that the application information is not included in a
human-readable way in metadata files generated by SAMManager.
To examine the application information, it is useful to declare (not store)
a test file to sam.
If you want to use the single pass method, configuration file Control
should be initialized with a list of subdivived input datasets. There
are scripts to do this, which are listed in README file.
The recursive method is definitely the preferred way to submit skimming jobs,
although the single pass method is still supported. Single pass
job submission is used for submitting recovery jobs.
Unmerged skimmed files appear in directories $BUFFERS/<skim>
Each of the above scripts (move_incoming.sh, check_files.sh,
prepare_store.sh, store_all.sh) can accept an optional integar
argument to specify the use of alternate check and data
directories. This allows multiple store processes to run in parallel.
For recent skimming versions, which use four parallel storing processees,
merging and storing cron scripts have been merged
into a single script
Here is a typical crontab entry that invokes cron_submit.sh and
cron_merge_store.sh at 15 minute intervals.
When processing large skims (like MUinclusive), it is sometimes
desirable to divide the input dataset into orthogonal pieces
(subskims) which can processed in parallel. It should be noted that
this use of the term "subskim" has nothing do with subskims that are
defined for physics purposes. These subskims are purely for
convenience in processing.
The procedure for defining subskims is as follows.
Since caf production uses recursive input datasets, before submitting
jobs using the manual method, make sure that no jobs for the specified
skim are running.
Each of the above scripts (move_incoming_all.sh, check_caf_all.sh,
prepare_store.sh, store_all.sh) can accept an optional integar
argument to specify the use of alternate check and data
directories. This allows multiple store processes to run in parallel.
Each of the above scripts (move_incoming.sh, check_fix.sh,
prepare_store, store_all.sh) can accept an optional integar
argument to specify the use of alternate check and data
directories. This allows multiple store processes to run in parallel.
Most projects include the following script that prints an overview summary
of the project, including batch job and disk status information.
Monitor the status of the batch system using qstat, and especially
find out about jobs owned by csgprod. Use the following kinds of
commands.
The following command gives an overall picture of the cab batch system.
The following scripts check for crashed worker nodes and stuck jobs.
All projects support the following command, which displays a summary of disk
usage in the final buffering stage just before storing in sam.
Projects that have a formal merging step (skimming and some fixing projects)
support the following command, which displays a summary of disk usage
on the disk where unmerged files are stored.
The following command prints a concise summary of sam projects running
on a station.
There are various reasons why sam may not deliver files. There might
simply be too high a load on sam or enstore. Or a tape may be temporarily
inaccessible because it is being used by another process or project. However,
sometimes projects may be stalled due to problems with sam or enstore. In
such cases, sending a message to d0sam-admin@fnal.gov sometimes
helps.
You can get a general feeling for the load and performance of enstore
from the enstore web page:
The best way to monitor sam storing is by watching the sam store process
log files. These log files are located at the following paths, depending
on whether this project has a formal merging step or not.
The "tail -24f" command
can be used to monitor the log files in real time.
Store requests can be monitored via the enstore web page
Store requests can also be monitored using the command:
If store requests seem to be pending for a long time, it might be because of
high enstore load or inaccessible tapes, both of which can be considered
normal. However, if pending store requests don't show up in the output
from "sam dump fss" or they don't show up in the enstore page,
this is evidence of a problem that requires expert intervention (send
e-mail to d0sam-admin@fnal.gov).
The following are examples of log files that appear in the home directory
of a production project, and which should be examined from time to time.
File stores that are interrupted by stager node or station reboots can
sometimes leave the file in a "limbo" state where it can't be stored
due to being in /pnfs but not having a location known to sam. Such partial
stores can sometimes be fixed using script check_store.sh. To
use check_store.sh do the following.
For any kind of production project, auditing should be done when the project
is stopped, meaning the following.
The actual procedures for auditing projects are different for different kinds
of projects
Sept. 9, 2010Logging in to csgprod
The csgprod does not have a (known) password. To log in to
account csgprod, you should arrange to have your kerberos
principal added csgprod's .k5login files (there are
two, one for clued0, and one for central systems).
Obtaining a ticket using keytab
Since csgprod does not have a password, you can not obtain a
ticket simply by typing kinit and entering a password.
However, csgprod does
have a keytab file, which was originally made on d0mino01, for the
special principal csgprod/cron/d0mino01.fnal.gov@FNAL.GOV. This
keytab exists in the standard place on d0mino01, namely,
"/var/adm/krb5/`kcron -f`". This keytab has been copied to selected
clued0 nodes, in the same standard place (note that the filenames are
not the same because "kcron -f" returns a hash of the node name and user name, which
is different on different nodes).
On any node where a copy of the keytab exists,
you can obtain a ticket for the special principal using the following
command:
kinit -f -k -t /var/adm/krb5/`kcron -f` csgprod/cron/d0mino01.fnal.gov
The above command is a slight variation of the command contained in
the standard script kcron. Kcron itself doesn't
work because it assumes that the node name embedded in the special
principal matches the node name on which kcron is executed,
which is not the case here (except on d0mino01). The standard command
kbatch works in the context of cab batch jobs.
Cvs package csgprod
Files used by the CSG production system are stored in D0 cvs package
csgprod. Files include scripts, and patches (data and code) for to an
existing frozen production release.
Structure of cvs package csgprod
To get access to the cvs repository, you need to get a kerberos ticket
using your own personal principal (the csgprod special principal does
not work for getting access to cvs).
Making a new production project
Defining input dataset
In the case of a caffing project, the input datasets are simply the output
datasets of a skimming project, which normally would already exist, so no
extra step would be required in this case.
sam create definition --defname=<input-dataset> \
--dim="APPL_NAME d0reco \
and VERSION <d0reco-version> \
and DATA_TIER thumbnail \
and TRIG_CONFIG_TYPE physics \
and PHYSICAL_DATASTREAM_NAME all% \
and RUN_NUMBER <runmin>-<runmax>"
In this example, <d0reco-version> is simply a production release
version, such as p20.16.08. The run range is optional. A run range
would typically be used for a fixing project where only a subset of the
data are being fixed.
Making a new branch tag
The first step in making a new production project is creating a new
branch tag in one of the three subdirectoryes (skimprod, cafprod,
fixprod), depending on the type of project. The steps for doing
this are as follows.
Note that we use "cvs tag" instead of "cvs rtag" because we only want
to tag one subdirectory, not the whole package. Also note that these
steps can be done from any account. Here is an example showing
specific cvs commands.
cvs co csgprod
cd csgprod/skimprod
cvs update -r <existing-branch-tag>
cvs tag -r <existing-branch-tag> <branch-point>
cvs tag -b -r <branch-point> <branch-tag>
cd ..
emacs TAGS &
cvs commit -m "Update TAGS."
Making a new working directory
Check out the newly created branch tag in a new directory, which will become
the working directory for the new production project. Only check out
one of the three subdirectories (skimprod, cafprod, fixprod).
For example,
cd /prj_root/5007/csg/csgprod/work
cvs co -d <directory> -r <branch-tag> csgprod/skimprod
Registering a new sam application version
Every production project corresponds to a unique sam application name
and version. The sam application/version will be included in the sam
metadata of every file produced by the project. The sam
application/version is the main way files produced by a project are
identified and extracted from sam later.
samadmin add application family --appFamily=tmbskim --appName=tmbskim --appVersion=<version>
samadmin add application family --appFamily=treemaker --appName=tmb_analyze --appVersion=<version>
samadmin add application family --appFamily=tmbfixer --appName=tmbfixer --appVersion=<version>
The above --appFamily and --appName arguments are standard.
Only the --appVersion argument should change.
Configuring a project
Each type of project (skimming, caffing, fixing) has its own pecularities,
but there are also some commonalities. The following kinds of configuration
apply to all projects.
Test project configuration
Here are some verification tests that I have found to be useful.
compdir <old-working-dir> <new-working-dir>
compdir <release-dir> <working-dir> <package>
The significance of the third argument of compdir is that it
only compares that one subdirectory (as opposed to the entire contents).
For example:
compdir /D0/dist/releases/p21.18.00 /prj_root/5007/csg/csgprod/work/skimprod-p21.18.00-p20.16.08 np_tmb_stream
cron_submit.sh
cron_merge.sh
cron_store.sh
The exact names of the cron scripts might vary (i.e. not all projects
have a formal merging step).
sam declare file <metadata-file>
sam get metadata --file=<filename>
sam undeclare file <filename>
Defining output datasets
Update CSG web pages
Instructions for making specific kinds of projects
Each type of project contains one or more README* files in its
subdirectory, customized for a particular branch-tag.
Skimming
Overview
Skimming involves the following three steps.
Skimming executable and rcp
Here are some essential data about the skimming program.
Preparation
Follow general instructions for configuring projects, with the following
additional specific instructions.
mkdir -p $SCRATCH # Scratch area.
mkdir -p $BUFFERS # For unmerged files.
mkdir -p $MERGED # For merged files.
get_skims.sh -d > SKIMS
mkdir $MERGED/incoming
mkdir $MERGED/check
mkdir $MERGED/check2
mkdir $MERGED/check3
mkdir $MERGED/check4
mkdir $MERGED/data
mkdir $MERGED/data2
mkdir $MERGED/data3
mkdir $MERGED/data4
mkdir $MERGED/store
The number of check and data subdirectories needed
depends on the number and contents of STORE_SKIMS* configuration
files.
mkdefs.sh
Skimming job submission
Normally, cron_submit.sh will be edited to submit jobs using
one of the two methods (recursivized or single pass).
cron_submit.sh
pcron # Get ticket.
./runTMBStream_recursive.sh
pcron # Get ticket.
./runTMBStream.sh
Merging
Merged files (all skims) appear in directory
$MERGED/incoming.
cron_merge.sh
listSkimmed.sh <skim> # Update .parents file
check_unmerged.sh <skim> # Checked for already merged files.
remove_merged.sh
pcron # Get ticket.
runMerge.sh <skim> # Submit batch job.
Storing files in sam
Storing files does not involve submitting batch jobs. Rather, storing
involved running a background process on a stager node.
cron_store.sh
cron_store2.sh
cron_store3.sh
cron_store4.sh
move_incoming.sh # incoming->check
check_files.sh # check->data
prepare_store.sh
pcron # Get ticket.
ssh <stager-node> "cd $MERGED/store; ./store_all.sh"
Stager nodes are listed in defs configuration file.
Skimming crontab
Skimming involves three steps (skimming, merging, storing). It is certainly
possible to contruct a crontab that invokes separate cron scripts for
each step.
cron_merge_store.sh
This script actually invokes merging and storing scripts sequentially.
cron_merge.sh
cron_store.sh
cron_store2.sh
cron_store3.sh
cron_store4.sh
This way of invoking merging and storing avoids certain undesirable race
conditions that can occur if merging and multiple storing scripts are
invoked independently.
10,25,40,55 * * * * cd /prj_root/5007/csg/csgprod/work/skimprod-p21.18.00-p20.16.08;./cron_submit.sh
2,17,32,47 * * * * cd /prj_root/5007/csg/csgprod/work/skimprod-p21.18.00-p20.16.08;./cron_merge_store.sh
Caffing
Overview
Caffing involves two steps.
Caffing executable and rcp
Here are some essential data about the caffinging program.
Preparation
Follow general instructions for configuring projects, with the following
additional specific instructions.
You can change the output file name, but the output file name stored
in TMBTreePkg*.rcp has to match environment variables
FILE_PREFIX and FILE_SUFFIX defined in the file defs.
mkdir -p $SCRATCH # Scratch area.
mkdir -p $OUTPUT_DIR # For output files.
mkdir -p $OUTPUT_DIR/incoming
mkdir -p $OUTPUT_DIR/check
mkdir -p $OUTPUT_DIR/check2
mkdir -p $OUTPUT_DIR/data
mkdir -p $OUTPUT_DIR/data2
mkdir -p $OUTPUT_DIR/store
The number of check and data subdirectories needed
depends on the number and contents of STORE_SKIMS* configuration
files .
mkdefs_all.sh
setenv SKIM <skim>
mksubdefs.sh
Batch job submission
Caf format output files appear in directory $OUTPUT_DIR/incoming (all
skims).
cron_submit.sh
pcron # Get ticket.
setenv SKIM <skim>
setenv JOBS <number-workers>
./runTMBAnalyze.sh
Storing files in sam
Storing files does not involve submitting batch jobs. Rather, storing
involved running a background process on a stager node.
cron_store.sh
cron_store2.sh
move_incoming_all.sh # incoming->check
check_caf_all.sh # check->data
prepare_store.sh
pcron # Get ticket.
ssh <stager-node> "cd $OUTPUT_DIR/store; ./store_all.sh"
Stager nodes are listed in defs configuration file.
Caffing crontab
Here is a typical crontab entry that invokes cron_submit.sh,
cron_store.sh, and cron_store2.sh at 15 minute intervals.
8,23,38,53 * * * * cd /prj_root/5007/csg/csgprod/work/cafprod-p21.18.00-p20.16.08;./cron_submit.sh
3,18,33,48 * * * * cd /prj_root/5007/csg/csgprod/work/cafprod-p21.18.00-p20.16.08;./cron_store.sh
13,28,43,58 * * * * cd /prj_root/5007/csg/csgprod/work/cafprod-p21.18.00-p20.16.08;./cron_store2.sh
Fixing
Overview
Fixing involves the following two steps.
Fixing usually does not involve filtering of input events. Therefore,
output files are the same size as input files and no merging step is
needed.
Fixing executable and rcp
Here are some essential data about the fixing program.
Preparation
Follow general instructions for configuring projects, with the following
additional specific instructions.
mkdir -p $SCRATCH # Scratch area.
mkdir -p $OUTPUT_DIR # For fixed files.
mkdir $OUTPUT_DIR/incoming
mkdir $OUTPUT_DIR/check
mkdir $OUTPUT_DIR/check2
mkdir $OUTPUT_DIR/data
mkdir $OUTPUT_DIR/data2
mkdir $OUTPUT_DIR/store
Fixing job submission
The single pass job submission method is always used for fixing. The
reason for this is that the single pass method allows multiple sam
projects to be running in parallel. Since fixing is more compute-intensive
than either skimming or caffing, it is desirable to keep the maximum number
of jobs running on cab at all times.
Output fixed files appear in directories $OUTPUT_DIR/incoming.
cron_submit.sh
pcron # Get ticket.
./runTMBFixer.sh
Storing files in sam
Storing files does not involve submitting batch jobs. Rather, storing
involved running a background process on a stager node.
cron_store.sh
move_incoming.sh # incoming->check
check_fix.sh # check->data
prepare_store
pcron # Get ticket.
ssh <stager-node> "cd $OUTPUT_DIR/store; ./store_all.sh"
Stager nodes are listed in defs configuration file.
Fixing crontab
Here is a typical crontab entry that invokes cron_submit.sh,
cron_store.sh, and cron_store2.sh at 15 minute intervals.
6,21,36,51 * * * * cd /prj_root/5007/csg/csgprod/work/fixprod-p20.18.02b;./cron_submit.sh
12,27,42,57 * * * * cd /prj_root/5007/csg/csgprod/work/fixprod-p20.18.02b;./cron_store.sh
2,17,32,47 * * * * cd /prj_root/5007/csg/csgprod/work/fixprod-p20.18.02b;./cron_store2.sh
Monitoring production projects
Most of the time production jobs run without human intervention under the
control of the various cron production scripts. The production scripts have
some ability to recover from certain common failures and errors. Other kinds
of failures require intervention by the user managing the production project,
or by outside support (d0sam-admin@fnal.gov or service desk).
Common problems
Some common failures that can cause production to stall are listed below.
Checking project status
Here is a list of checks that you should make frequently (e.g.
on a daily basis). Some of the checks make use of specially written
scripts. Some simply involve using standard system commands.
skim_status.sh
qstat @d0cabsrv1 -u csgprod
qstat @d0cabsrv1 | grep csgprod
Make sure that number of batch jobs is reasonable. Check that the clock
and cpu times are reasonable.
cab_status.sh
monitor.sh
monitor.sh d0cabsrv1
monitor.sh d0cabsrv2
If you find a batch job that is dead because it is running on a crashed
cab node, or can't make progress for some other reason, purge it from
the batch system using the purge_job command (runs on
d0mino04, d0mino05, and d0mino06).
purge_job 12345.d0cabsrv1
purge_job 12345.d0cabsrv2
check_output.sh
check_buffers.sh
A stalled project may manifest itself either by either of the above
commands reporting disks being unusually empty or unusually full.
project_status_all.sh fnal-cabsrv1
project_status_all.sh fnal-cabsrv2
The above script is especially useful for spotting stalled projects that
haven't had files delivered in a long time.
http://www-d0en.fnal.gov/enstore/status_enstore_system.html#D0-LTO4F1.library_manager
cd $OUTPUT_DIR/store # For projects without a merging step.
cd $MERGED/store # For projects with a merging step.
tail -24f log.store # Store stream 1 log file
tail -24f log2.store # Store stream 2 log file
tail -24f log3.store # Store stream 3 log file
tail -24f log4.store # Store stream 4 log file
Output is generated from store processes to these log files
at least once per minute. In
normal conditions, you should see that no store requests are pending,
or you should see store requests succeeding.
http://www-d0en.fnal.gov/enstore/status_enstore_system.html#D0-LTO4F1.library_manager
sam dump fss --station=fnal-cabsrv2
Note that the argument "--station=fnal-cabsrv2" is not just an
example. Store requrests are always made from the fnal-cabsrv2
sam station.
submit.log
merge.log
store.log
store2.log
store3.log
store4.log
For example, if jobs aren't being submitted, submit.log will
usually give a clue to the reason (e.g. not enough disk space, previously
submitted jobs not finished, etc.).
Things to check after a downtime
Problems of all kinds are more common after downtimes. Here are some
common problems that are especially likely after downtimes and remedies.
Auditing and error recovery
For any kind of production project, it will be necessary to verify that all
files from the input dataset(s) have been processed and the results stored
in sam. This must be done at least when a project is ending. It may be
beneficial to do intermediate audits from time to time if a project has
been running for a long time.
Auditing skimming projects
Here is the procedure for auditing skimming projects.
./check_all.sh >& check_all.log
./define_makeup.sh all
This script defines datasets for the missing files, and as a side effect
makes files containing lists of missing files by skim, called
<skim>.missing, and all.missing which is the union
of missing files from all skims. Newly defined datasets for missing
files are appended to configuration file Control. You can display
the newly created datasets using the following script.
./check_control.sh
pcron # Get ticket.
./runTMBStream.sh
If there are too many missing files to submit all at once, you can let
let them be submitted automatically by cron using cron_submit.sh.
Make sure cron_submit.sh invokes the single-pass
job submission script runTMBStream.sh rather than the
recursive job submission script runTMBStream_recursive.sh (edit
cron_submit.sh if necessary).
Auditing caffing projects
Here is the procedure for auditing skimming projects.
./check_datasets_all.sh
Or simply wait until no new jobs have been submitted by cron for a long
time.
./check_skims.sh >& check_skims.log
setenv SKIM <skim> # Skim
setenv JOBS <n> # Number of jobs
./define_makeup.sh # Define makeup dataset.
./runTMBAnalyze.sh # Submit jobs.
Choose the number of jobs depending on the number of missing files.
It is usually good to set the number of jobs to be about half of the number
of missing files, up to a maximum of 120.
Auditing fixing projects
Here is the procedure for auditing fixing projects.
./check.sh >& check.log
./define_makeup.sh all
Newly defined datasets for missing
files are appended to configuration file Control. You can display
the newly created datasets using the following script.
./check_control.sh
pcron # Get ticket.
./runTMBFixer.sh
If there are too many missing files to submit all at once, you can let
let them be submitted automatically by cron using cron_submit.sh.