Common Sample Group Production System

Contents

  1. Production account csgprod.
    1. Logging in to csgprod.
    2. Obtaining a ticket using keytab.
  2. Cvs package csgprod.
    1. Structure of cvs package csgprod.
  3. Making a new production project.
    1. Defining input dataset.
    2. Making a new branch tag.
    3. Making a new working directory.
    4. Registering a new sam application version.
    5. Configuring a project.
    6. Test project configuration.
    7. Defining output datasets.
    8. Update CSG web pages.
  4. Instructions for making specific kinds of projects.
    1. Skimming.
      1. Overview.
      2. Skimming executable and rcp.
      3. Preparation.
      4. Skimming job submission.
      5. Merging.
      6. Storing files in sam.
      7. Skimming crontab.
    2. Caffing.
      1. Overview.
      2. Caffing executable and rcp.
      3. Preparation.
      4. Batch job submission.
      5. Storing files in sam.
      6. Caffing crontab.
    3. Fixing.
      1. Overview.
      2. Fixing executable and rcp.
      3. Preparation.
      4. Fixing job submission.
      5. Storing files in sam.
      6. Fixing crontab.
  5. Monitoring production projects.
    1. Common problems.
    2. Checking project status.
    3. Things to check after a downtime.
  6. Auditing and error recovery.
    1. Auditing skimming projects.
    2. Auditing caffing projects.
    3. Auditing fixing projects.

Production account csgprod

All Common Sample production jobs run under account csgprod, which exists on both central systems (d0mino0X, d0srvXXX, cab) and clued0.

Logging in to csgprod

The csgprod does not have a (known) password. To log in to account csgprod, you should arrange to have your kerberos principal added csgprod's .k5login files (there are two, one for clued0, and one for central systems).

Obtaining a ticket using keytab

Since csgprod does not have a password, you can not obtain a ticket simply by typing kinit and entering a password. However, csgprod does have a keytab file, which was originally made on d0mino01, for the special principal csgprod/cron/d0mino01.fnal.gov@FNAL.GOV. This keytab exists in the standard place on d0mino01, namely, "/var/adm/krb5/`kcron -f`". This keytab has been copied to selected clued0 nodes, in the same standard place (note that the filenames are not the same because "kcron -f" returns a hash of the node name and user name, which is different on different nodes). On any node where a copy of the keytab exists, you can obtain a ticket for the special principal using the following command:
kinit -f -k -t /var/adm/krb5/`kcron -f` csgprod/cron/d0mino01.fnal.gov
The above command is a slight variation of the command contained in the standard script kcron. Kcron itself doesn't work because it assumes that the node name embedded in the special principal matches the node name on which kcron is executed, which is not the case here (except on d0mino01). The standard command kbatch works in the context of cab batch jobs.

The above version of the kinit command is attached to the alias pcron when you log in to csgprod.

If you want to be able to obtain a ticket on a particular clued0 node that doesn't have a copy of the keytab, simply put a copy of the keytab in the standard place (note that it is a violation of Fermilab security policy to put a copy of a keytab on an nfs-shared filesystem). It may be necessary to run the privileged command kcroninit to create the keytab file, which you can then overwrite, or have root install the keytab.

Cvs package csgprod

Files used by the CSG production system are stored in D0 cvs package csgprod. Files include scripts, and patches (data and code) for to an existing frozen production release.

Structure of cvs package csgprod

To get access to the cvs repository, you need to get a kerberos ticket using your own personal principal (the csgprod special principal does not work for getting access to cvs).

Making a new production project

Defining input dataset

In the case of a caffing project, the input datasets are simply the output datasets of a skimming project, which normally would already exist, so no extra step would be required in this case.

In the case of skimming and fixing projects, the input dataset usually consists of all-stream files produced by the production farm. No predefined dataset exists for such files, so in these cases it is necessary to define an input dataset.

Here is the standard way to define an all-stream dataset.

sam create definition --defname=<input-dataset> \
  --dim="APPL_NAME d0reco \
         and VERSION <d0reco-version> \
         and DATA_TIER thumbnail \
         and TRIG_CONFIG_TYPE physics \
         and PHYSICAL_DATASTREAM_NAME all% \
         and RUN_NUMBER <runmin>-<runmax>"
In this example, <d0reco-version> is simply a production release version, such as p20.16.08. The run range is optional. A run range would typically be used for a fixing project where only a subset of the data are being fixed.

Making a new branch tag

The first step in making a new production project is creating a new branch tag in one of the three subdirectoryes (skimprod, cafprod, fixprod), depending on the type of project. The steps for doing this are as follows. Note that we use "cvs tag" instead of "cvs rtag" because we only want to tag one subdirectory, not the whole package. Also note that these steps can be done from any account. Here is an example showing specific cvs commands.
cvs co csgprod
cd csgprod/skimprod
cvs update -r <existing-branch-tag>
cvs tag -r <existing-branch-tag> <branch-point>
cvs tag -b -r <branch-point> <branch-tag>
cd ..
emacs TAGS &
cvs commit -m "Update TAGS."

Making a new working directory

Check out the newly created branch tag in a new directory, which will become the working directory for the new production project. Only check out one of the three subdirectories (skimprod, cafprod, fixprod). For example,
cd /prj_root/5007/csg/csgprod/work
cvs co -d <directory> -r <branch-tag> csgprod/skimprod

Registering a new sam application version

Every production project corresponds to a unique sam application name and version. The sam application/version will be included in the sam metadata of every file produced by the project. The sam application/version is the main way files produced by a project are identified and extracted from sam later.

Sam does not allow arbitrary application names and versions. Only names and versions that have previously been registered in the sam database can be used. Here are examples of commands to register new applications and versions (requires sam admin privilege).

Skimming:

samadmin add application family --appFamily=tmbskim --appName=tmbskim --appVersion=<version>

Caffing:

samadmin add application family --appFamily=treemaker --appName=tmb_analyze --appVersion=<version>

Fixing:

samadmin add application family --appFamily=tmbfixer --appName=tmbfixer --appVersion=<version>
The above --appFamily and --appName arguments are standard. Only the --appVersion argument should change.

Configuring a project

Each type of project (skimming, caffing, fixing) has its own pecularities, but there are also some commonalities. The following kinds of configuration apply to all projects.

Test project configuration

Here are some verification tests that I have found to be useful.

Defining output datasets

Update CSG web pages

Instructions for making specific kinds of projects

Each type of project contains one or more README* files in its subdirectory, customized for a particular branch-tag.

Skimming

Overview

Skimming involves the following three steps.
  1. Skimming proper. Batch jobs running on cab read all-stream files from sam and write multiple output streams to disk.
  2. Merging. Skimmed files are merged to a target size of 1 Gbyte per file. Files are read from disk and saved to a different disk.
  3. Storing in sam. Merged files are stored in sam.

Skimming executable and rcp

Here are some essential data about the skimming program.

Preparation

Follow general instructions for configuring projects, with the following additional specific instructions.

Skimming job submission

Normally, cron_submit.sh will be edited to submit jobs using one of the two methods (recursivized or single pass).

If you want to use the single pass method, configuration file Control should be initialized with a list of subdivived input datasets. There are scripts to do this, which are listed in README file.

The recursive method is definitely the preferred way to submit skimming jobs, although the single pass method is still supported. Single pass job submission is used for submitting recovery jobs.

Unmerged skimmed files appear in directories $BUFFERS/<skim>

Merging

Merged files (all skims) appear in directory $MERGED/incoming.

Storing files in sam

Storing files does not involve submitting batch jobs. Rather, storing involved running a background process on a stager node.

Skimming crontab

Skimming involves three steps (skimming, merging, storing). It is certainly possible to contruct a crontab that invokes separate cron scripts for each step.

For recent skimming versions, which use four parallel storing processees, merging and storing cron scripts have been merged into a single script

cron_merge_store.sh
This script actually invokes merging and storing scripts sequentially.
cron_merge.sh
cron_store.sh
cron_store2.sh
cron_store3.sh
cron_store4.sh
This way of invoking merging and storing avoids certain undesirable race conditions that can occur if merging and multiple storing scripts are invoked independently.

Here is a typical crontab entry that invokes cron_submit.sh and cron_merge_store.sh at 15 minute intervals.

10,25,40,55 * * * * cd /prj_root/5007/csg/csgprod/work/skimprod-p21.18.00-p20.16.08;./cron_submit.sh
2,17,32,47 * * * * cd /prj_root/5007/csg/csgprod/work/skimprod-p21.18.00-p20.16.08;./cron_merge_store.sh

Caffing

Overview

Caffing involves two steps.
  1. Caffing proper. Batch jobs running on cab read tmb format skimmed data from sam and write a single caf format output stream to disk. Thirteen tmb format skims are processed into 14 caf format output skims (the MUinclusive tmb skim is processed into both MUinclusive and MUhigh caf skims). There is no formal merging step, but approximately two input files are processed to produce one output file to keep the output file size near 1 Gbyte.
  2. Storing files in sam.

Caffing executable and rcp

Here are some essential data about the caffinging program.

Preparation

Follow general instructions for configuring projects, with the following additional specific instructions.

Batch job submission

Caf format output files appear in directory $OUTPUT_DIR/incoming (all skims).

Since caf production uses recursive input datasets, before submitting jobs using the manual method, make sure that no jobs for the specified skim are running.

Storing files in sam

Storing files does not involve submitting batch jobs. Rather, storing involved running a background process on a stager node.

Caffing crontab

Here is a typical crontab entry that invokes cron_submit.sh, cron_store.sh, and cron_store2.sh at 15 minute intervals.
8,23,38,53 * * * * cd /prj_root/5007/csg/csgprod/work/cafprod-p21.18.00-p20.16.08;./cron_submit.sh
3,18,33,48 * * * * cd /prj_root/5007/csg/csgprod/work/cafprod-p21.18.00-p20.16.08;./cron_store.sh
13,28,43,58 * * * * cd /prj_root/5007/csg/csgprod/work/cafprod-p21.18.00-p20.16.08;./cron_store2.sh

Fixing

Overview

Fixing involves the following two steps.
  1. Fixing proper. Batch jobs running on cab read all-stream files from sam and write a single tmb format output stream to disk.
  2. Storing in sam. Fixed files are stored in sam.
Fixing usually does not involve filtering of input events. Therefore, output files are the same size as input files and no merging step is needed.

Fixing executable and rcp

Here are some essential data about the fixing program.

Preparation

Follow general instructions for configuring projects, with the following additional specific instructions.

Fixing job submission

The single pass job submission method is always used for fixing. The reason for this is that the single pass method allows multiple sam projects to be running in parallel. Since fixing is more compute-intensive than either skimming or caffing, it is desirable to keep the maximum number of jobs running on cab at all times. Output fixed files appear in directories $OUTPUT_DIR/incoming.

Storing files in sam

Storing files does not involve submitting batch jobs. Rather, storing involved running a background process on a stager node.

Fixing crontab

Here is a typical crontab entry that invokes cron_submit.sh, cron_store.sh, and cron_store2.sh at 15 minute intervals.
6,21,36,51 * * * * cd /prj_root/5007/csg/csgprod/work/fixprod-p20.18.02b;./cron_submit.sh
12,27,42,57 * * * * cd /prj_root/5007/csg/csgprod/work/fixprod-p20.18.02b;./cron_store.sh
2,17,32,47 * * * * cd /prj_root/5007/csg/csgprod/work/fixprod-p20.18.02b;./cron_store2.sh

Monitoring production projects

Most of the time production jobs run without human intervention under the control of the various cron production scripts. The production scripts have some ability to recover from certain common failures and errors. Other kinds of failures require intervention by the user managing the production project, or by outside support (d0sam-admin@fnal.gov or service desk).

Common problems

Some common failures that can cause production to stall are listed below.

Checking project status

Here is a list of checks that you should make frequently (e.g. on a daily basis). Some of the checks make use of specially written scripts. Some simply involve using standard system commands.

Things to check after a downtime

Problems of all kinds are more common after downtimes. Here are some common problems that are especially likely after downtimes and remedies.

Auditing and error recovery

For any kind of production project, it will be necessary to verify that all files from the input dataset(s) have been processed and the results stored in sam. This must be done at least when a project is ending. It may be beneficial to do intermediate audits from time to time if a project has been running for a long time.

For any kind of production project, auditing should be done when the project is stopped, meaning the following.

The actual procedures for auditing projects are different for different kinds of projects

Auditing skimming projects

Here is the procedure for auditing skimming projects.
  1. Check configuration file defs that parameter PARENT_DATASET is properly set. This is the dataset that will be checked to make sure it has been fully processed. It is reasonable to set this to the full static input dataset ($DATASET) if the project is supposed to be finished. If recursivized job submission is being used (but is temporarily stopped), you can set the parent dataset to $CONSUMED_DATASET.
  2. If this is the final audit for a d0reco version, check the farm processing page until all datasets for that d0reco version are marked FINISHED or COMPLETED (or learn from Mike Diesburg that the farm is finished processing a d0reco version).
  3. Wait for all skimming batch jobs to finish.
  4. Edit configuration file defs and set FULL_MERGE=1.
  5. Run check_reservations.sh to check if there are any file locks on unmerged files. If there are locked files, clear the locks using clear_reservations.sh.
  6. Wait until all unmerged files are finished merging (use command check_buffers.sh).
  7. Edit configuration file defs and set FULL_MERGE=0.
  8. Wait for all merged files to be stored in sam (use check_output.sh).
  9. Check whether any output files have parents that have been declared bad in sam using script find_bad.py. If this script reports any files with bad parents, have these files declared bad. Repeat until find_bad.py does not report any files with bad parents.
  10. Check whether any output files are "virtual" using script find_virtual.sh (a virtual file has metadata but no location). If any virtual files are found, delete the sam metadata using sam command "sam undeclare file". Repeat until find_virtual.sh doesn't report any virtual files.
  11. Run script check_all.sh. This script can take a long time (many hours) to run. It is best to save the output from check_all.sh in a log file.
    ./check_all.sh >& check_all.log
    
  12. If check_all.sh reports any duplicate files, declare the duplicate files bad (requires sam admin privilege, or ask d0sam-admin@fnal.gov). Repeat steps 11-12 until there are no reported duplicate files. It is rare, but not unheard of, for duplicate files to be generated.
  13. If check_all.sh reports any missing files, run the following script.
    ./define_makeup.sh all
    
    This script defines datasets for the missing files, and as a side effect makes files containing lists of missing files by skim, called <skim>.missing, and all.missing which is the union of missing files from all skims. Newly defined datasets for missing files are appended to configuration file Control. You can display the newly created datasets using the following script.
    ./check_control.sh
    
  14. If there are only a few missing files, you can submit them manually using the manual single-pass method.
    pcron   # Get ticket.
    ./runTMBStream.sh
    
    If there are too many missing files to submit all at once, you can let let them be submitted automatically by cron using cron_submit.sh. Make sure cron_submit.sh invokes the single-pass job submission script runTMBStream.sh rather than the recursive job submission script runTMBStream_recursive.sh (edit cron_submit.sh if necessary).
  15. Wait until all recovery skimming jobs have been submitted (check_control.sh does not generate any output), and all submitted batch jobs have finished.
  16. Repeat steps 4-15 until there are no missing or duplicated files.
  17. If you edited cron_submit.sh, change it back to the way it was originally.

Auditing caffing projects

Here is the procedure for auditing skimming projects.
  1. Wait until recursive input datasets are empty. You can use the following script to check this.
    ./check_datasets_all.sh
    
    Or simply wait until no new jobs have been submitted by cron for a long time.
  2. Wait for any caffing batch jobs to be finished.
  3. Wait until all output files are stored in sam (use check_output.sh).
  4. Check whether any output files have parents that have been declared bad in sam using script find_bad.py. If this script reports any files with bad parents, have these files declared bad. Repeat until find_bad.py does not report any files with bad parents.
  5. Check whether any output files are "virtual" using script find_virtual.sh (a virtual file has metadata but no location). If any virtual files are found, delete the sam metadata using sam command "sam undeclare file". Repeat until find_virtual.sh doesn't report any virtual files.
  6. Run script check_skims.sh. This script can take a long time (many hours) to run. It is best to save the output from check_skims.sh in a log file.
    ./check_skims.sh >& check_skims.log
    
  7. If check_skims.sh reports any duplicate files, declare the duplicate files bad (requires sam admin privilege, or ask d0sam-admin@fnal.gov). Repeat steps 6-7 until there are no reported duplicate files. It is rare, but not unheard of, for duplicate files to be generated.
  8. If check_skims.sh reports any missing files, edit configuration file defs and uncomment the line that defines MAKEUP_DATASET. Make sure the definition of MAKEUP_DATASET is unique (has not been used before).
  9. Do the following commands for each skim that has any missing files.
    setenv SKIM <skim>   # Skim
    setenv JOBS <n>      # Number of jobs
    ./define_makeup.sh   # Define makeup dataset.
    ./runTMBAnalyze.sh   # Submit jobs.
    
    Choose the number of jobs depending on the number of missing files. It is usually good to set the number of jobs to be about half of the number of missing files, up to a maximum of 120.
  10. After you are done submitting makeup batch jobs, edit defs to comment out the definition of MAKEUP_DATASET again.
  11. Repeat steps 4-10 until there are no missing files.

Auditing fixing projects

Here is the procedure for auditing fixing projects.
  1. Check configuration file defs that parameter PARENT_DATASET_NAME is properly set. This is the dataset that will be checked to make sure it has been fully processed. It is reasonable to set this to the full static input dataset ($INPUT_DATASET_NAME) if the project is supposed to be finished.
  2. If this is the final audit for a d0reco version, check the farm processing page until all datasets for that d0reco version are marked FINISHED or COMPLETED (or learn from Mike Diesburg that the farm is finished processing a d0reco version).
  3. Wait for all fixing batch jobs to finish.
  4. Wait for all fixed files to be stored in sam (use check_output.sh).
  5. Check whether any output files have parents that have been declared bad in sam using script find_bad.py. If this script reports any files with bad parents, have these files declared bad. Repeat until find_bad.py does not report any files with bad parents.
  6. Check whether any output files are "virtual" using script find_virtual.sh (a virtual file has metadata but no location). If any virtual files are found, delete the sam metadata using sam command "sam undeclare file". Repeat until find_virtual.sh doesn't report any virtual files.
  7. Run script check.sh. This script can take a long time (many hours) to run. It is best to save the output from check.sh in a log file.
    ./check.sh >& check.log
    
  8. If check.sh reports any duplicate files, declare the duplicate files bad (requires sam admin privilege, or ask d0sam-admin@fnal.gov). Repeat steps 7-8 until there are no reported duplicate files. It is rare, but not unheard of, for duplicate files to be generated.
  9. If check.sh reports any missing files, run the following script.
    ./define_makeup.sh all
    
    Newly defined datasets for missing files are appended to configuration file Control. You can display the newly created datasets using the following script.
    ./check_control.sh
    
  10. If there are only a few missing files, you can submit them manually using the manual single-pass method.
    pcron   # Get ticket.
    ./runTMBFixer.sh
    
    If there are too many missing files to submit all at once, you can let let them be submitted automatically by cron using cron_submit.sh.
  11. Wait until all recovery fixing jobs have been submitted (check_control.sh does not generate any output), all submitted batch jobs have finished, and all output files are stored in sam.
  12. Repeat steps 5-11 until there are no missing or duplicated files.

Comments to CSG Conveners

Sept. 9, 2010