Design manifesto and developer’s guide
JIM,
or SAMGrid, job managers are the main part of SAMGrid job
management on the Grid-Fabric boundary. JIM job managers provide a framework for
instantiating SAMGrid job at the execution site as a collection of local jobs.
Developed as part of the SAMGrid project, they manage the complexities of HEP
jobs, most notably, “production” D0 and CDF jobs such as Monte-Carlo and
reprocessing. Our job managers conform to the standard Globus GRAM protocol and
rely on two more components: SAM batch adapters/idealizers and JIM sandboxing.
Note
that the component presented herein is on the Fabric side of the overall SAMGrid job management developed within
JIM; we therefore caution the reader against misunderstanding the scope of the
component. The Grid-side (or
Grid-level) job management in JIM is based on the Condor-G technology as
documented elsewhere. The two levels work organically together as described in
SAMGrid papers and summarized later in the document.
The
primary purpose of this document is to disclose our design ideas and shed
sufficient light on the implementation so that further development is facilitated.
Our intended audience is therefore comprised by SAMGrid developers as well D0
and CDF collaborators who wish to extend the SAMGrid job management suite with
new job types (of course, extensions of the design framework are encouraged as
well). This document is not a replacement for in-line documentation and tries
to avoid excessive detailing which may become de-synchronized with the code
itself, i.e. our focus is on the ideas rather than the code.
The
document is organized as follows. First, we sketch the whole picture of the job
management in SAMGrid. We then proceed to the principal features of the job
management at the Fabric boundary, i.e. our package. Next, we give some details
of the physical design and implementation. We conclude with the status of this
SAMGrid component.
We
summarize JIM job managers as follows (references to detailed papers are found
under the Papers section at http://www-d0.fnal.gov/computing/grid/). At the high (Grid) level, user jobs are
described logically as requests; for example, a job of type D0 Monte-Carlo has MC Request ID, specification of the D0
release version, data input (including minimum bias mix-in) as SAM dataset(s),
any other control parameters and, lastly, the size of the job such as the total number of events desired. The job
is presented to the Grid scheduler through extremely thin user interface:

Figure 1.
SAMGrid job management from the Grid level
The
scheduler (queuing system) communicates with the Request Broker to determine
the Grid site for the job to run. In initial implementation, jobs are not split
across multiple sites. The Broker is embodied as the Condor-G Matchmaking
Service (MMS), which is a unique SAMGrid design feature. The Broker utilizes
information collected from the participating sites through the native Condor-G
advertisement framework. (The MMS and the Collector, conceived in the original
Condor system, were introduced at the Condor-G level through the D0-Condor
collaboration under PPDG). Once advised by the Broker, the Scheduler inserts
(pushes) the job into the chosen site through the Gatekeeper using the standard
GRAM protocol implemented in Condor-G. The job is received by the JIM job
managers described herein.
We
emphasize the hierarchical structure of the job. A single Grid job is mapped
onto many local (i.e. materialized in the batch system of the site) jobs. In
our opinion, this provides a clear, hierarchical view of the jobs where the
Grid-level job management deals only with high-level jobs, easily understood by
user scientists, and detailed decomposition of the job into runnable (in the
batch system) tasks is left for the Fabric-resident services. This “divide and
conquer” paradigm therefore facilitates job management and scales well with the workload increase.
Strictly speaking, our job structure is such that a Grid job is mapped (decomposed) onto one or more cluster jobs, each cluster job being scheduled at one site; it is the cluster job that is decomposed into a collection of local jobs. As of the time of wring this document, however, Grid job corresponds to only one cluster job, and in the remainder of the document we use cluster job and Grid job interchangeably.
To
complete the big picture, the Fabric-side JIM job managers perform the rest of
the SAMGrid job management; we now proceed to the details.
Once
the JIM job manager (JJM) receives the job specification, it dispatches the job
to the handler appropriate for this job type (specified by the user at the time
of Grid job submission). The main purpose of the handlers is to instantiate the Grid job as a collection
of local jobs. We first describe the most important part – job submission, we
then touch on lookup/kill, and conclude with notes on how the overall flow of
control in JJM’s is driven.
First,
checks for errors are done. Note that, since SAMGrid does not rely on any experiment-specific
software preinstalled at the site, there is considerably less room for errors,
for example, there does not exist a possibility of user specifying a D0 release
version that is not installed at the site.
Second,
the number of the jobs to be submitted is determined, based on:
·
the size of the
Grid job (specified as part of the request or explicitly in the Grid job
definition file) in terms of Physics units such as events
·
the
user-supplied CPU per event, if any,
·
the locally
configured capabilities of the site, in terms of recommended, or maximum, CPU
per local job.
The
last two quantities are harder to define in practice; in their absence, “rules
of thumb” are used, e.g. the typical number of D0 MC events per job is 250.
Additional considerations include whether it is desirable/possible to handle
multiple output files in each phase of the job etc.
Third,
the job’s starting command line with arguments is determined. For example, in
D0 MC the command is “mc_runjob” with one of the arguments being the SAM
request ID.
Fourth,
the job is physically packaged and prepared for submission, using the JIM
sandbox service, whereby the job driver is wrapped and packaged together with
other files such as SAM clients.
Fifth,
the actual job submission command is read from the SAM batch adapters
“database” and the command is actually executed the appropriate number of
times, computed in the second step. The local job ID’s are extracted from the
output.
Sixth,
monitoring is initiated by creating an entry in the local JIM XML database and
populating it with the associated local jobs.
JIM
Monitoring of the local jobs at the Grid-Fabric boundary has two aspects. One
validates the existence of the constituent local jobs in the batch system (BS),
and thus keeps the status of the Grid job as “active” as far as Grid scheduling
machinery is concerned. It is very
important that JJM machinery be able to validate the job’s presence.
Otherwise the Grid machinery will prematurely initiate job shutdown, or,
conversely, it may indefinitely “think” that the job is still running at the
cluster. Thus, all the “glitches” of the batch system and/or the head-node must
be carefully absorbed through retrials and other robustness mechanisms in the
lookup commands. In part, these retrials are implemented in BS-specific way
through BS idealizers (i.e. wrappers
around batch system commands that make their reports more precise). These
idealizers were added to the SAM batch adapters as part of preparing the batch
system tools for addition of upper software layers. In part, retrials work at
the level above the idealizers in the JJM’s themselves. We also use the term status aggregation to represent the
return of a simple GRAM status after querying for multiple local jobs. Most of
the time, the aggregated status is “active” or “done”; “unsubmitted” and
“error” are also used.
The
other aspect is primarily intended for human consumption to provide detailed
information about the job states. Unlike in the first aspect, we gather and
analyze detailed information about each individual local job, including their
completion times, using XML databases local to sites. What is more, further
detail specific to the job type at hand can be easily added to the same
database by virtue of extensibility of XML data structures. For example, the states
of individual Monte-Carlo phases are published in our database. (In case of D0,
it is possible to display that e.g. the d0gstar phase has completed 117 out of
250 events). Some of this specific information is populated by means external
to JJM’s, including the D0 MC driver, mc_runjob, or by other JIM
infra-structure such as sandboxing wrappers.
The
SAM batch adapters are used again and must therefore have been configured
properly for JJM’s to function.
Similarly
to submission and monitoring, we need to provide the service of stopping (more
or less reliably) all the local jobs that constitute the Grid job. The only
non-trivial part of job termination is the gathering of all the standard
output/diagnostic/other files, deposited by the jobs back into their sandbox. JJM’s make up a file name stem
for these output files, with unique suffix such as the local job ID, appended
by (for) each job. These files are combined into a tar file, log files of the
JJM’s themselves are added, and the resultant single output file is returned through
GRAM protocol back to the Grid (i.e. to the Grid job spool area at the submission site).
The
overall flow of control is driven by the Globus job managers, as we aimed at
maximizing the use of Globus toolkit and minimizing the development of thick
high-level (application-specific or VO-specific) services. Here we give more
detailed definition of the job managers and describe the flow of control.
When
the initial GRAM (Grid job) request is received, the Globus gatekeeper spawns
the Globus job manager process which
uses the job manager scripts for the
purposes of actually managing the jobs. SAMGrid job managers are actually these
scripts driven by the external process (job manager per se as defined in the
Globus terminology). This process calls the submission script, then it
repeatedly calls the lookup script, and eventually the termination script,
after which it exits itself.
To
conform to this protocol our scripts have to return certain information in a
simple format into the standard output stream, and they are required to
complete their operation within a short time allotted to them (30 seconds in
the earlier versions of Globus, with more freedom added later). Since the batch
system response time is uncontrollable, it was technically difficult to ensure
the timeliness of completion. In addition, debugging was complicated as no
other messages could be output, and any accidental output would greatly confuse
the job manager process. As an implementation detail, we overcame these
difficulties by launching scripts in the background and carefully avoiding
duplicate processes.
Naturally,
control flow would be simplified if we had developed the driver process,
similar to the ALIEN environment.
The following figure depicts the internal structure of our package.

Figure 2.The internal structure of the JIM job manager package
As we
have described earlier, the Globus job manager is outside our package. The Perl
to Shell adapter were written when the Globus job management bundle changed
from Shell to Perl, thus incapacitating the scripts that we had developed with
an earlier Globus release. The actual shell scripts provide Globus protocol
conformance, including the aforementioned compensation for the batch system
response time. These pass control to the main module, run_grid_job.py, which actually implements all the functionality
described in the section on principal features.
The
main implementation module uses a (semi-) abstract job manager adapter.
Handlers of particular job types (such as D0 Monte-Carlo) are implemented in
separate files. Some of the methods in this adapters are:
·
getNumSubmissionTimes()
·
prepareWorkArea()
·
getCommandAndArguments()
We
refrain from complete listing which might quickly become obsolete and refer the
reader to the actual code (see below).
The
bulk of the code is common for all the job types with concrete JM adapters
being quite thin. To extend the package with a handler for a new job type (e.g.
CDF file concatenation), one, obviously, needs to provide a subclass of the JM
adapter class and then register the new (Python) file as a “known module” in
the main implementation file.
The
implementation physically resides in the CDCVS package jim_job_managers. Its
src/python subdirectory contains the main, common run_grid_job.py file along
with concrete “plug-ins” for the particular job types (e.g.
run_grid_job_cdfmc.py). Other directories of interest are src/shell and
src/perl.
The
package is presently distributed via FNAL KITS. It uses, as a dependency, the
jim_sandbox and sam_batch_adapters products and miscellaneous utilities common
for many SAMGrid packages. When the package is distributed, the code is mostly
moved from the “src” to the “libexec” directory. When SAMGrid job managers are installed,
the package registers itself as the Globus job manager of the type
“jobmanager-samgrid”. It is then advertised to SAMGrid by the jim_advertise
package.
The product’s core framework is deemed to be reasonably stable and we do not anticipate additional significant development in the near future. We anticipate that more RUN II job types will be adapted and added, hence we hope that this document has accomplished the goal of introducing the ideas to the future developers.
Please send your suggestions (or) comments about this document to Igor Terekhov (terekhov@fnal.gov) and Gabriele Garzoglio(garzoglio@fnal.gov).
Last updated on Monday, August 30, 2004.