The Theory and Practice
This
document presents the ideas, implementation, and installation instructions for
the SAMGrid job submission (and related) services on the boundary between the Grid and the Fabric. Its primary target audience is comprised by the D0 and CDF
collaborators, and possibly their associated computing cluster administrators,
who are not necessarily intimately familiar with Grid computing but are
knowledgeable (expert) in managing of Run II jobs and/or computing clusters. In
other words, our readers are “SAMGrid station administrators” who set up and
maintain remote installations for Run II data processing (Monte-Carlo,
reprocessing, and possibly analysis).
We
observed that in the setup of a SAMGrid execution site, the job submission
interface was the most difficult step, and this document was originally
perceived as part of the deployment guide. Both the JIM developers and the
deployment participants, however, realized that, in addition to step-by-step
instructions, we needed to clarify the purpose and the ideas in JIM’s grid to
fabric job submission interface. Thus, this document strives to shed light on
JIM’s approach for job management, both in theory and in practice, in order to
facilitate further development of SAMGrid expertise at the sites. Our other,
equally important goal is to invite open and constructive criticism of our
technical design.
Generally
speaking, SAMGrid provides several other services on the Grid-Fabric boundary
such as resource monitoring and advertisement. The present document focuses
solely on the job submission. We begin with a (possibly incomplete) terminology
introduction. We then present the rationale and design of SAMGrid job
submission at the Fabric. We then detail the actual installation and
configuration of the relevant software components. We apologize if some readers
find that too much detail is given to something apparently trivial, but all the
questions were actually raised in the course of the deployment. We conclude
with miscellaneous notes.
First,
a clarification of terminology is in order. Fabric
is a somewhat fancy word (which we didn’t invent) used to describe the
collection of physical computing (as well as storage etc) facilities that
comprises Grid. We do not attempt to
define the general term Grid Computing;
for our purposes, this includes automated globally distributed computing on
resources associated with SAMGrid. “Grid
to fabric interface” means a small collection of services that are implemented on the wide boundary between the Grid and Fabric levels. Note that we use term
“service” in the software sense. Physically, these boundary services are
carried out by servers and/or scripts running on the head-node (also referred to as gateway
node). These services, inasmuch as job submission is concerned, allow for a Grid job, scheduled at the site at hand
by the Grid scheduler, to be instantiated at the execution site.
Job instantiation at the site physically means submission of multiple local jobs to the batch system, including preparation of all the necessary non-Physics data as “input” and subsequent retrieval of the small output (i.e. output files such as logs that are not destined for a full-fledged data handling system such as SAM). The movement of non-Physics input data is done from/to SAMGrid submission site, where the (Grid-level) jobs are spooled, and which in turn is typically close to the SAMGrid client site from where the user submits the Grid jobs.
As a member of HEP computing community, the reader is certainly familiar with the application-imposed complexities in the instantiation and management of a real Run II physics. Hundreds, often thousands of small files must be supplied with the job in a manner that is efficient enough so as not to break local file transfer mechanisms. The jobs shouldn’t interfere even when several of them are scheduled on a single node. The number of local jobs running in parallel must be determined so as to maximize the probability of job completion (within the batch system imposed boundaries) yet not to have too many small jobs producing too many small output files.
From our perspective, we need to emphasize Fabric-imposed complexities in (SAM)Grid computing. In general, even if a site has been running real physics jobs locally for years, it may not be straightforward to add this site to SAMGrid (or any other serious Grid) for two principal reasons:
1) There seem to exist an uncountable multitude of local configurations in terms of directory structures, shared file systems, conventions for naming standard output/diagnostic files, designated mechanisms of intra-cluster small file transfers, etc, which is not managed by any “standard” Grid (or not) software. For example, your local users may be accustomed to having the standard output/diagnostic deposited in their “home area” upon job completion. This doesn’t quite work when the “grid” (i.e. foreign) user doesn’t have NFS-shared home. Even though new local accounts may be created for Grid users, it is highly desirable to not have the Grid jobs leave files behind.
2) Job submission by a machine such as Grid middleware imposes more stringent requirements on the site’s software services than job submission by a human scientist. For example, all the batch systems we have used thus far could “forget”, at one point or another, some of the jobs running in them; whereas a human would simply shrug it off and re-issue the query command minutes later, the Grid machinery would have failed the job for no good reason. It is impossible to guarantee 100% accuracy, but it often becomes necessary to have wrappers that absorb e.g. transient glitches in “lookup” commands. As another example, if as much as a single node in the cluster is malfunctioning, the batch system typically enjoys sending all the local jobs to the bad machine because it fails (completes) jobs too quickly and achieves spuriously high turnaround; a human would easily modify his submission script to avoid this black hole effect but this is not straightforward to automate.
In our bi-directional education process with the experts from sites, we from our side seek that the experts understand these (and derived) issues which drove the design of the Grid to Fabric job submission interface. We now proceed to a description of the design.
There is mini-architecture of the Grid-Fabric Interface job submission service suite:

We now describe the roles of each block on the diagram.
We adopted (but not invented) a term that despite its appearance is intended to represent a serious concept. These, as the name implies, “idealize” the batch systems to make their interactions with Grid machinery easier, by “mitigating” any imperfections and adding any “missing” features. Mitigation includes:
· retries in lookup commands for certain batch systems,
· generation of easy to parse output (batch system commands return output that’s usually too terse or too verbose),
· compensation for confusing exit status from batch system commands.
Added features include:
· grouping of jobs for all batch systems by an attribute such as generalized “project” (i.e. in SAMGrid rather than SAM sense),
· local scratch management on the worker nodes, i.e. setup and cleanup of the scratch space before/after user job execution,
· (optional) explicit preference/avoidance of nodes that are /are not well suited for the grid job(s) in question.
Scratch management deserves a special remark. Whereas in theory one can argue whether this service belongs in the batch system or not, (PBS, for example, knows nothing about it), we find it extremely useful and strongly recommend that the batch jobs run in their separate scratch directories, for reasons of performance and mutual isolation. All the computing facilities we have seen have worker nodes with a few GB scratch space which is seldom, if ever, used. For PBS, we provide a “special” scratch setup script that wraps user job and creates/deletes a subdirectory before/after the job in the location that must be configured (see below).
There is one idealizer provided as a template for each batch system with which we are familiar. Site experts are expected to look inside these and modify them to accommodate the local configurations by e.g. setting the full paths to correct values. These are located within the sam_batch_adapter package.
These adapters, located in the sam_batch_adapter package, were originally intended for use by the “sam submit” command, which in turn was intended to provide the correct interface to submit SAM analysis job to a batch system, and performing actions such as starting/stopping of a SAM project, see http://d0db.fnal.gov/sam_batch_adapter/sam_batch_adapter.html. In the expanded job management scheme, provided by the JIM and other SAMGrid tools, this package serves as the configuration tool for the job submission/lookup/kill commands, implemented in the aforementioned idealizers, i.e. the adapters for jobs coming from the Grid must be configured to use the idealizer appropriate for your local batch system.
The difference between “adapters” and “idealizers” is that the former provide uniform interface to the batch system, whereas the latter provide the scripts that actually correctly implement these interfaces. For example, the adapter concept contains an interface to lookup a job in the batch system, and an idealizer will actually perform the lookup, handle some of the errors, and return a complex, multi-line string that is nevertheless easy to parse. In a broader sense, “adapters” include “idealizers”.
This service is provided within the jim_sandbox software package and is documented therein, so we merely provide a brief summary. “Sandboxing” in SAMGrid refers to the ability to transfer and initialize all the relevant input files for the user job, as well as correct collection and return of “small output”, thus avoiding the poorly conceived and controversial “home area” concept for a Grid user on a cluster which he/she does not own. It allows for reproducibility of job results by providing independence from any pre-installed experiment software. For input sandboxing, a staged bootstrapping process is used whereby each subsequent stage uses results of the previous stage, the last and most advanced stage being retrieval of a small input dataset through the SAM data handling system.
These implement the services of grid job instantiation at the execution site, by means of mapping a logical grid job definition (with details provided by e.g. SAM Monte-Carlo request system) to set of local jobs submitted to the batch system. Our job managers come with the jim_job_managers software package and are installed into the Globus job manager area. When activated, they receive the job request via the standard GRAM protocol and perform multiple creation, submission, lookup and kill of the local jobs comprising the Grid job. In addition, they allow for XMLDB-based monitoring of Grid jobs which is at the heart of JIM Grid job monitoring.
The SAMGrid job managers, when receiving the Grid job request for the first time, calculate the number of the local jobs to be submitted and then prepare these local jobs using the sandboxing mechanism. File name stems for both standard output and diagnostics are chosen; local jobs will use different files as unique suffixes are appended. The result of the local job preparation is a self-extracting executable, i.e. an executable containing other files needed for the bootstrapping process. The job managers then consult the (properly configured) batch adapters to learn which command to use for job submission, and execute the command the appropriate number of times.
When the job is executing at the worker node, the sandbox bootstrapping rolls out and eventually passes control to the job wrapper, provided by the job managers from the head-node (i.e. local submission machine) for each known job type (D0 and Monte-Carlo, file merging, general SAM analysis, etc). These wrappers perform functions such as initialization of the experiment’s software “release tree” in the local scratch space etc, and finally pass control to the job script (which is either supplied by the Grid user as in the case with CDF MC or is standard for the given job type, such as the mc_runjob-based script for D0 MC).
As we have said earlier, the very first step one needs to do
for a SAMGrid execution site deployment is to configure local job submission. Do not try to proceed with JIM installation
until after you have understood and resolved all the issues in this document
and are able to submit jobs as described below.
Install (do not configure it yet) the sam_batch_adapter package from KITS.
~/> setup sam_batch_adapter
~/> ls –l $SAM_BATCH_ADAPTER_HANDLER_DIR
Choose the one that looks like sam_XXX_handler.sh where XXX is similar to your batch system. Copy this file to a “JIM area”, rename if you like. Edit this file to ensure that all the paths are set correctly. For PBS users: PBS idealizer consists of two files, the other file being the scratch manager, pbs_scratch_setup.sh. You also need to copy this file to a “JIM area”; there is no need to modify anything in the file but be sure that the sam_pbs_handler.sh script contains a correct path to the scratch manager. Also, you need to choose a scratch disk path (on the worker node) and specify it correctly in the handler script.
Now go back to the “JIM area” or sub-area thereof from where the local jobs will be submitted (this path will later be given to the configurator script for the jim_sandbox package) and test the idealizer. Be sure to do so as the same user as the one to which the Grid jobs are mapped. We can’t possibly list all the reasons why even though it works for username X (local user) it may not work for user Y (whichever account you designate for Grid jobs, typically “samgrid”)! Execute a command like this:
/big_disk/samgrid/jim/> cp /bin/ls ./binary
/big_disk/samgrid/jim/> ./sam_pbs_handler.sh job_submit --project=test –executable=$PWD/binary –stdout=$PWD/out –stderr=$PWD/err –arguments=”-laF”
This, obviously, should result in a job submission. A second or so later, try:
/big_disk/samgrid/jim/> ./sam_pbs_handler.sh job_lookup --project=test
This command must return information about the job you have just submitted.
A few minutes later, when the job is finished, there must be a file with the name “out” in this directory!!! If there was a problem, check to see if the “err” file contains any message. If not, check if the batch system has sent you an email (obviously, this is not good enough for the Grid but is a means to troubleshoot local job submission). Most often occurring problems are PBS-specific (more in a later section):
If you see the file “out” and it contains (aside from any debug messages) the listing of an almost blank directory on the worker node, congratulations! We are almost done!
Now tell the batch adapter that the way to handle your jobs is through the above newly hacked script. Use the “sambatch” command, available after setting up the sam_batch_adapter product. This command is extremely intuitive and straightforward, with solid run-time documentation (additional documentation is available at http://d0db.fnal.gov/sam_batch_adapter/sam_batch_adapter.html). Be sure to specify submit, lookup and kill commands (all implemented as the corresponding actions in the handler script) to the batch adapter. Your configuration should look similar to the following:
sam@samgfarm2:~>
sambatch display --station=samgfarm
Station: samgfarm
Default Adapter: grid
Available Adapters: ['grid']
Adapter: grid
Available Commands: ['job kill command', 'job
lookup command', 'job submit command']
Command: /bin/sh -c '.
/local/ups/etc/setups.sh;setup
fbsng;${SAM_BATCH_ADAPTER_HANDLER_DIR}/sam_fbsng_handler.py job_kill
--project=%__USER_PROJECT__ --localJobId=%__BATCH_JOB_ID__'
Type: job kill command
Known Outcomes:
Exit Status:
0
Outcome Description:
Success
Exit Status:
1
Outcome
Description: Failure
Command: /bin/sh -c '.
/local/ups/etc/setups.sh;setup
fbsng;${SAM_BATCH_ADAPTER_HANDLER_DIR}/sam_fbsng_handler.py job_lookup
--project=%__USER_PROJECT__ --localJobId=%__BATCH_JOB_ID__'
Type: job lookup command
Known Outcomes:
Exit Status:
0
Outcome
Description: Success
Exit Status:
0
Expected
Output: JobId=%__BATCH_JOB_ID__ Status=%__BATCH_JOB_STATUS__
Exit Status:
1
Outcome
Description: Failure
Command: /bin/sh -c '.
/local/ups/etc/setups.sh;setup
fbsng;${SAM_BATCH_ADAPTER_HANDLER_DIR}/sam_fbsng_handler.py job_submit
--project=%__USER_PROJECT__ --executable=%__USER_SCRIPT__
--arguments=%__USER_SCRIPT_ARGS__ --stdout=%__USER_JOB_OUTPUT__
--stderr=%__USER_JOB_ERROR__'
Type: job submit command
Known Outcomes:
Exit Status:
0
Outcome
Description: Success
Exit Status:
0
Expected
Output: %__BATCH_JOB_ID__
Exit Status:
1
Outcome
Description: Failure
It is now necessary to execute the test script, found in the package: $SAM_BATCH_ADAPTER_DIR/etc/testBatchAdapters.py. This tests both the current and the previous step of the installation. If your jobs execute too fast or too slowly for this test script, feel free to hack it as needed. If the script tells you that the job has executed and the standard output was correctly received, the local job submission configuration is successful! You can now return to the main SAMGrid manual to complete your installation.
If this is not successful, we will be happy to help explaining what the expected behavior is, but the problems need to be debugged with the help of your system administrators. The latter are more knowledgeable in local issues such as rcp/NFS, local account setup, any batch system bugs, permissions on your local directories etc.
Valeria Bartsch of CDF has succeeded (with experts’ help) in configuring grid to fabric job submission for her PBS cluster, prior to the release of this document. She has her useful experience documented at http://home.fnal.gov/~bartsch/sam_batch_adapter_docu/. Thanks to Valeria, and we hope that this links is still valid at the time you’re trying to follow it!
We are at this time soliciting contributions from our D0 collaborators as well.
If you are using
PBS:
Please send your suggestions (or) comments about this document to Igor Terekhov and Gabriele Garzoglio.
Last updated on Monday, August 30, 2004.