Design manifesto and maintainer’s guide
JIM
sandbox is a part of the SAMGrid job management. It was originally intended to
be used on the Grid-Fabric boundary by the JIM job managers, which instantiate SAMGrid
job at the execution site as a collection of local jobs, see the document on
Grid to Fabric job submission interface. The purpose of the sandbox component
is to provide a viable abstraction for a collection of all the files except the
input/output data that are required
by a user job. The services provided are: packing and unpacking of the files,
as well as physical management (movement) of file collections both within the
local cluster and through the Grid/Fabric gateway.
In
what follows, we describe the rationale behind the sandboxing development. We
then give some details of the design and implementation, which should be
sufficient to start delving into the inline documentation as necessary for
maintenance, and then notes on installation and configuration. We conclude with
package status and issues.
The
need for a sandbox abstraction and associated services has emerged as follows. In
a traditional (pre-Grid) computing model, users tend to make at least two
assumptions about the environment where they execute their jobs. These
assumptions, which are violently broken on a Grid, follow:
1.
Standard
software is installed cluster-wide, by means of e.g. NFS-exported UPS products
tree. Software installed includes the experiment software as well as numerous
“infrastructure” packages such as the Python interpreter.
2.
There is a
durable (almost permanent), “no-cost” local storage, called home area, where the jobs and agents
(such as batch systems) acting on behalf thereof can safely deposit small
files. These files include both those needed to bootstrap the job (input) and
any logs produced (output).
The
first assumption is now often stated explicitly, and the experiments are lifting
it by developing tools to envelop their applications and provide appropriate
run-time environment (good examples are D0 RTE and CDF CAF). As we will later
explain, user code and other run-time
environment must be augmented with
additional infrastructure packages and
then deployed physically at the worker nodes (i.e. computers where actual
data processing takes place, usually in the batch mode). In SAMGrid, this is accomplished by JIM
sandboxing.
What
is more, the second assumption is almost always implied and cast in stone. The
“home area” concept has a long history and is part of the broader concept of an
account, whereby computer access is
controlled statically. Grid computing in general (not JIM or SAMGrid in
particular) strives to provide a fuller and much more dynamic resource control
by virtue of sophisticated authorization frameworks. The Run II physics
experiments are adopting and driving these services. Thus, it becomes
increasingly necessary to be able to move job files in and out of the execution
node bypassing home. JIM sandboxing obviates the home area concept.
Incidentally,
both of the above assumptions involve usage of a shared file system such as
NFS. Many of our collaborators, from D0 and CDF, as well as system
administrators have repeatedly expressed dissatisfaction with reliance on a
shared file system. It is these people, who actually have had rich experience
managing jobs on large clusters that shaped our cautious attitude towards
shared file systems. The most common issue is the performance bottleneck
(because of the centralized topology and UDP-based communication); low security
(NFS authentication is IP based) should also be mentioned. JIM sandboxing provides complete independence of the shared file system.
In
addition to the above assumptions, (SAM)Grid computing faces the well-known
issue of dramatic variation of the computing environment of the sites of its
deployment. Development of a uniform job submission interface, which could be
used by standard Grid machinery such as that of Condor-G/JIM, was severely
complicated by this heterogeneity, especially when it came to the mechanisms of
job file transfer. Most systems relied on a shared file system; some used batch
systems with built-in file transfer mechanisms, etc. To make the task of the
Grid-Fabric job submission interfacing manageable, we (SAMGrid developers)
decided to develop a standalone component, which would be separate from the
actual job management, and which minimized
the dependence on the local site configuration. This independence could not
be complete because at least one executable and at least one output file per
job had to be transferred by locally configured mechanisms (e.g. by the batch
system). We reduced, however, the management of hundreds of job-related files
to understanding such a local configuration for very few files with subsequent
bootstrapping of the sandboxing, whereby the same software is used at all the
participating sites, however dissimilar.
Last
but not least, we observe that some of the files needed by the user job, at
least in the case of coordinated activities such as Monte-Carlo (MC)
production, are the same for many jobs. For example, the standard D0 and CDF
code releases are used multiple times for different MC requests, and efforts
began within the experiments (outside of JIM) to package (pieces of) releases
as tar-balls. At some point, it was simultaneously proposed by a number of
people to use a sophisticated data
handling system such as SAM for retrieval of such large (GB size) common
files, and thereby leverage some of the powerful data handling features. To
name a few of such features:
Ø central bookkeeping
Ø dynamic (on demand) data replication, as
opposed to manual installation of releases from repositories such as KITS,
Ø intelligent caching of common files with
automatic reclamation of space from old, unused files (that are in the GB
range),
Ø file transfer throttling and other (global)
resource management,
Ø robustness through retrials and failover for
alternative replicas,
Ø ease of interface with a Grid-level
scheduler/resource broker,
Obviously,
these features have a less profound effect for job files, than for the actual
data; but being able to leverage an existing technology instead of developing
new tools was a definite advantage. Note that we can restate the second point
above more strongly as independence from
the software pre-installed on sites, which in turn is part of the highly
desired independence of physics results from the identity of the site.
Thus,
a software component was conceived to provide the above services for the job
management. It was not designed and developed from scratch but grew within the
JIM job management suite and was eventually identified and cut out. As for the
term “sandboxing” itself, it was originally used in security context, to
provide isolation of the user programs from its ambience – the hosting execution environment. We (and
many others) use the same word in a different, complementing sense. If you
would like a comparison to the real-life sandbox, we provide bagging services
for bringing the toys in and carrying garbage out of the play area, whereas the
security context means understanding of the boundaries of the area, and rules e.g.
not to throw sand out. Obviously, both aspects are needed.
To
provide the services described above, we followed the following strategies
while designing JIM sandboxing:
Ø Develop an easy way to gather all the files
required by the job in a single logical container, called sandbox. These files
include both those specified by the user explicitly and those implied by the job,
such as the X509 user proxy, configuration instances, file transfer clients
etc.
Ø Provide an easy mechanism to transfer this
entire collection to the worker node of the execution site. “Ease” refers to
the ability to insert execution of sandbox management code before and after
actual user job and thus redefine the job as far as the local batch system is
concerned, without dramatically complicating the definition of the wrapped job.
For a counter-example, specifying in a submission command line a long list of
files that must be pre-staged is unacceptably tedious and error-prone.
Ø Control the transfer of the sandbox constituents,
whose number and/or size may be large, to multiple destinations within the
cluster, i.e. to many worker nodes. Such control is desired for efficiency and
reliability, to avoid hundreds of simultaneously starting jobs accessing
retrieving their constituents. As another counter-example, implicit file
retrieval (access) through a home-like shared disk area is completely
unmanageable in the case of NFS.
Ø Provide a service to the user job for
returning a “small” output back to the aforementioned logical collection. This
output is separate from any data designated for data handling system and
includes log files, etc. Symmetrically to the input reading, this service should
be transparent to the batch system (for ease of job submission configuration),
efficient and controllable.
Thus,
we start with a sandbox as a logical
container. Although we map sandboxes to (initially blank) disk directories, we
strive to provide a level of abstraction slightly above a disk directory and
other operating system concepts. It is physically created at the head-node of
the cluster. We then allow the user (i.e. the job management software) populate
the sandbox with the necessary constituents (files), typically by means of
creating symbolic links. Of course, we check for errors while dereferencing
these links. Once the sandbox is finalized, the user requests a handle that can
be used to reconstitute the sandbox on another machine (e.g. worker node). We
accomplish this by packing the
sandbox, instantiating a sandbox
replication service (to be explained later) and returning, as the handle, a
bootstrap executable. Packaging
includes addition of files internal to JIM sandbox and creation of the first
stage of the bootstrapping process, which is presented next.
The
term “bootstrap” means a sequence of construction/initialization stages whereby
each subsequent stage uses the machinery created in the previous stage. There
are three stages in the course of sandbox setup:

Figure 1. Bootstrapping stages in JIM Sandbox
Initially,
there is a bare minimum assumed about the machine where the sandbox replication
takes place – nothing beyond the standard OS with “sh” and “tar”. The bootstrap
executable combines both control scripts and its input and is therefore the one
and only file that must be
transferred by the batch system in a way that must be configured locally. This
binary contains in itself the files required for the second stage of the
bootstrap process. [1]
These second-stage files are the binaries, libraries and configuration files
for the file transfer mechanisms as well as the script containing instructions
to retrieve a list of files from the physical location of the sandbox. When
executing, the bootstrap binary unpacks these files and passes control to the
“sandbox manager” script, or stage two. As a technical detail, our
self-extractor is a compiled binary, which we preferred over a “standard” ASCII
UNIX facility such as “shar” because of unavailability of the “uudecode” which
the latter uses and of the better size and speed of the program. Thus, we
require the presence of a C compiler at the time of the preparation and we
assume that the packing occurs on the same architecture as the unpacking.
Stage
two invokes file transfers to fetch the user
sandbox, i.e. the file initially supplied by the SAMGrid user at the time
of job submission, as well as “application”-specific files needed for the stage
three. Our stage three transfers are done via the SAM data handling system, and
therefore the files retrieved in stage two are the data handling (SAM) clients:
sam_user_api (or its successors), sam_cp and their dependencies.
Stage
three is, strictly speaking, outside of JIM sandboxing per se. It retrieves SAM
datasets specified by the upper-level layers (JIM job managers and other
SAMGrid services) and passes control to the “user script” – the script supplied
by the end user at the time of SAMGrid job submission. In principle, this
script may be the base of a new, user-resigned bootstrapping sequence involving
even more files.
The
core of the sandbox replication service is a built-in file transfer mechanism.
It was designed to be part of the sandboxing management for the sake of facilitation
of the configuration of SAMGrid execution sites. We chose a flavor of gridFTP,
which for our purposes is nothing but a common FTP client/server suite with specially
configured security mechanisms. Our choice of gridFTP is driven by the popularity
of both the underlying FTP mechanisms and the GSI standards, as well as by the ease
of derivation of the security context from that of the associated Grid job.
The
actual file transfers are authenticated with (a form of) the same X509
credential that was used to authenticate the Grid job at the execution site at
hand. This proxy credential is an important part of the job’s Grid context; it
is typically accessed through the X509_USER_PROXY environment variable. The
authorization file is also derived from that of the cluster’s gatekeeper and
restricts access to those users who were authorized to run jobs at this site in
the first place.
Conceptually,
this service is instantiated dynamically for the duration of the Grid job (i.e.
for the lifetime of the associated local jobs in the batch system). In
practice, we prefer to deploy statically a server on the gateway node, called
jim_gridftp, that can be used for multiple jobs for multiple users. Dynamic
server starting/stopping is also supported the jim_gridftp software package,
which provides additional isolation of individual Grid job from each other.
In
the second stage of the bootstrapping process, JIM sandbox sets an environment
variable, OUTPUT_FILE, that is propagated through the subsequent stages to all
the layers of the user job. The user job may gather files that it deems
important (these may include any core files, as long as they are not too big) and create e.g. a compressed
tar file pointed to by the variable. Upon completion of the user job
(successful or not), the second-layer sandbox manager uses the same file
transfer mechanism as the one in the sandbox replication service, to transfer
the aggregated output back to the physical location of the sandbox (on the
head-node). Afterwards (and outside the scope of JIM sandbox), when the JIM job
managers terminate the Grid job and destroy the sandbox , such output files
from all the local jobs are aggregated further into what can be considered the
output of the Grid job as a whole. This final file is later pulled back through
the gateway back to the Grid file spooling area, and ultimately, is retrieved
by the Grid job owner.
The
implementation physically resides in the CDCVS package jim_sandbox. Its
src/python subdirectory contains the main sandbox.py file with the Sandbox
class definition; additional implementation files are found in the same directory.
Both the CLI and the Python API are provided for the functionalities described
in the Section on design: create(), enter(), add(), package(), destroy().
The
src/shell subdirectory contains the shell scripts for the packing of the
bootstrap. Other scripts are used to save and restore the configuration of the
(SAMGrid) packages in the form of the files which can easily be added into the
sandbox and processed by the described mechanisms. The src/c subdirectory
contains miscellaneous C routines, some of which probably belong in a more
generic “util” package. An important utility is the “sleeper” which allows to
pause the current process until the beginning time of the X509 proxy validity,
which is used to compensate local clock discrepancy, which, in turn, is
required by the gridFTP client. The etc subdirectory of the package contains,
most importantly, the template for the second stage sandbox manager and a
“half-packed” tar file with the gridFTP client. More detailed information is
contained in the code itself.
The
package is presently distributed via FNAL KITS. It uses, as a dependency, the
jim_gridftp product and miscellaneous utilities common for many SAMGrid
packages. When installing the product, the only essential configuration
parameter is the physical disk location where sandboxes will be physically
created as OS directories. The size of this local storage is determined by the
product of the number or Grid jobs (O(10), the SAMGrid design intelligently
avoids proliferation of the number of Grid jobs by structuring them
appropriately) by the “typical” sandbox size (O(1GB)).
The product is deemed to be reasonably stable and we do not anticipate additional significant development in the near future. It has been thoroughly refurbished and stripped of most of the unnecessary and legacy code. Perhaps the only item that was planned but postponed, due to the lack of immediate need, was the stage-two throttling of the sandbox constituents’ transfers through a mechanism such as “fcp”. (We use such control widely used at stage three for bigger files.)
Some of the issues remain at the design level, however.
Please send your suggestions (or) comments about this document to Igor Terekhov (terekhov@fnal.gov) and Gabriele Garzoglio(garzoglio@fnal.gov).
Last updated on Friday, August 27, 2004.
[1] We often refer to this executable as a self-extractor, inspired by the self-extracting archives used since the times of MSDOS.