D0 MC Dial-A-Job HelpDial-A-Job (DAJ) provides a simple graphical user interface for D0 Monte Carlo Jim job submission to SAMGrid. With DAJ a user may create Job Description Files (JDFs) for the three job types supported: dzero_monte_carlo, dzero_merge, and structured. The JDF's can be saved and submitted to SAMGrid. The user completes a form whose values are validity checked. To facilitate JDF creation a SAM portal is provided to query SAM for available jobfiles datasets and display their contents. The portal also allows creation of jobfiles datasets from the available files in SAM for D0 releases, d0runjob versions, and cardfiles. An MC request can be reserved for submission to SAMGrid via invocation of the Queue.py script from DAJ. Please see the JIM Job Submission For Monte Carlo Requests page for an introduction to JIM job submission and the SAMGrid Manual for an in depth discussion of SAMGrid.
DAJ also provides one click submission of production and merging jobs, and recovery job submission that asks only for the request Id. The simplest method of starting a production job is to use the Ignition window where the user selects a site to run production of an MC request, and then activates a button to start a request. When daj_daemon.py is running use of the Ignition window (select site, activate Go) is the only feature the user need use to process a request. When daj_daemon.py is not running the Ignition window and the Recover entry of the Jobs menu are the only features the user need to process a request.
DAJ is written in python and should run on versions 2.3 or later. DAJ uses the Tk interface to python and hence depends on the Tk tool kit being installed.
DAJ operates in two modes, normal and remote. Normal mode is the default and is meant to be used from a machine where sam and jim_client are able to be setup. In the case where the machine from which DAJ is running is not able to setup sam and jim_client, remote mode may be used. Remote mode is invoked by passing the argument 'remote' on the command line to daj.py. Remote mode requires password-less ssh entry to an account on a machine where sam and jim_client can be setup. The script remote_daj must be placed on the remote machine. The location of the remote_daj script is specified in the dajrc initialization file. Remote mode is discussed further below.
In normal mode DAJ will refuse to start until the environment is correct. The user must setup sam, jim_client, and obtain a valid grid proxy before the DAJ main window appears. The grid proxy may be obtained via a DOEGrids certificate or from a valid Kerberos ticket for the FNAL.GOV realm. A grid certificate is preferred because of the ability to set long proxy lifetimes.
The DAJ main window has a message area and menubar. From the File menu the user can choose to create a JDF for a supported job type or load an existing JDF. From the Tools menu the user may open a SAM portal window or invoke the Queue.py script to reserve a request to submit to the grid.
Upon choosing JDF type from the Create submenu a JDF form window appears. The job attribute fields are filled in by the user who may then act on the form through the buttons in the window. From this window JDF's can be saved to disk and submitted to SAMGrid. Before being saved or submitted the JDF undergoes a validity test. See JDF Window for more detail. The SAM portal window allows the querying of SAM for available jobfiles datasets and displays their contents. The portal will also create jobfiles datasets from the files available in SAM for D0 release, d0runjob, and cardfiles which are available from drop down menus. The portal is designed to facilitate completion of the jobfiles_dataset entry box in the JDF window. See SAM Portal for more detail.
DAJ provides access to every function from the keyboard as well as the pointing device.
DAJ can also provide one click submission of production and merge
jobs. By supplying certain job configuration settings in the dajrc
initialization file an activation of the File/Auto/Monte_Carlo menu
item will reserve the next request by running Queue.py, get the
request information from SAM, create a JDF from this info and the
settings info provided by the user in the initialization file, and
submit the job. The submission of a job creates a record in a
persistent database that is used when the File/Auto/Merge menu item is
activated. The activation of this item queries the user for a request
id then extracts production job information for this request from the
database and constructs a merge job jdf which is then submitted. For
more details see Auto below.
Command Line Options
The command line arguments available are remote,
test, fast, and help.
DAJ supports an optional initialization file, dajrc, to specify job attribute defaults and user preferences. See the supplied dajrc.template file as an example. The customizations available are listed below. Only some JDF attributes are reasonable to specify in the initialization file for instance minbias_dataset and notify_user. The file dajrc is searched for in the order, value of DAJDRC environment variable if defined then in the working directory. The initialization file ignores blank lines and lines beginning with '#'. The significant lines are key value pairs separated by an equals sign. The attribute keys are their names. The preference keys with defaults in parentheses are:
See the file dajrc.template for an example. Note that the remote_daj
value must end with the following characters between the double
quotes: " ''"
(i.e. space-single quote-single quote).
Remote Mode
Remote mode allows DAJ to function on a machine unable to setup sam or jim_client. Remote mode is invoked by passing the argument 'remote' on the command line to daj.py. Remote mode requires password-less ssh entry to an account on a machine where sam and jim_client can be setup. The script remote_daj must be placed on the remote machine. The location of the remote_daj script is specified in the dajrc initialization file. In remote mode no environment check is made before DAJ starts. Commands are executed on the remote machine via ssh invocation of the remote_daj script. In the case of JDF submission to SAMGrid, the JDF file is first copied to the directory on the remote machine where the remote_daj script resides according to the value of the remote_daj variable specified in the initialization file. Then a check for a valid grid proxy is made on the remote machine. If one isn't found the user is prompted for the Grid pass phrase of the user's grid identity or the Kerberos password for a ticket in the FNAL.GOV realm so that a proxy can be obtained. A command is then issued to the remote machine to submit the JDF file.
The remote_daj script needs to be tailored to the system on which it resides. There are configuration variables in the script that can be modified to accomplish the tailoring. These variables, their defaults, and their meaning are listed below.
For the 'Get Request' and 'Request Audit' features to be functional in remote mode the files Queue.py and request_audit.py (both included with DAJ) must be placed on the remote machine in the same directory. The location of that directory is specified in the initialization file.
Depending on connectivity and load remote commands may be executed
much more slowly than in normal mode so patience on the part of the
user is required.
OSG & LCG Job Submission
Job submission to OSG and LCG sites can be done by specifying
osg-ouhep as the station_name for the OSG
or ccin2p3-grid1 as the station_name for
the LCG, and supplying a grid_resource_requirement_string
value. These can be specified via the Create
entries of the Jobs Menu, or via the Ignition window. Job submission to OSG requires
valid grid credentials in the Fermi myproxy database. See Get Credentials below for more
information. OSG job submission requires that the Fermi product vdt
v1_3_2_3 be installed though not declared current on the local
daj.py machine or at the remote_daj
site. Note this is not the version of vdt used for the JIM products.
Resource Pool Job Submission
Jobs may also be send to an external broker for disposition through the use of resource pools. The only external broker service presently implemented is the ReSS for OSG. The ReSS pool is specified as the grid_resource_requirement_string value in the JDF. The value is constructed by specifying each computing element as the value of the GlueCEInfoContactString keyword and logically and'ing these together with this syntax:
grid_resource_requirement_string = (GlueCEInfoContactString == CE1) || (GlueCEInfoContactString == CE2) [... || (GlueCEInfoContactString == CEn)]
Where CEi is the i'th computing element in the resource
pool which is specified as a standard OSG CE. The station name
is given as the OSG station.
The grid_resource_requirement_string easily becomes cumbersome with just a few sites in the resource pool. To address this the site resource parameter may also be the name of a pool resource object. The resource pool name may be used as the value of the station_name attribute in the JDF form, and is displayed as a job site in the Ignition window. In this case no grid_resource_requirement_string is specified. The pool resource object is defined in the daj_pooldef file. For a resource pool the identifier name in the file is of the form: name@pooltype;ce1,ce2,...,cen. Where cei has the form gate.keeper.address:port/jobmanager-type e.g.
ress1@resspool;grid1.oscer.ou.edu:2119/jobmanager-lsf,osg-gw-2.t2.ucsd.edu:2119/jobmanager-condor 2
The cei's must be a known resource defined in
sites.py and be appropriate to the type of pool.
Only resource pool type resspool for OSG sites is implemented. For
pool definitions lines are continued if the last character on a line
is a backslash (\) e.g.
ress1@resspool;grid1.oscer.ou.edu:2119/jobmanager-lsf,\
osg-gw-2.t2.ucsd.edu:2119/jobmanager-condor 2
The daj_pooldef file is read at program startup and
may be edited and reread by invoking the 'Read Pool Defs' item in the
File menu. The defined resource pools may be inspected by invoking
the 'Show Pool Defs' item in the Settings menu. See the file
daj_pooldefs.template for an example.
The DAJ main window has a message area and menubar. Messages are reported here from the main window, JDF editor, and SAM portal.
Available menus are File, Tools, Settings, and Help. The Help menu
has entries related to all functions of DAJ. Choosing an entry opens
the topic on the help page in a web browser. Links to the MonteCarlo
Production page, SAM Dataset Definition Query page, and SAMGrid
Monitoring pages are also provided. The Settings menu allows
customization of user settings. The menu items are discussed below.
File Menu
The user may also specify a request to start in the entry box of the Ignition window. If a request is specified an initial production job for the request is constructed and submitted. This will bypass getting the next request via Queue.py and therefore will not change the request status. This should only be used when appropriate and is not the usual way to process a request. If no request is specified in the Ignition window the next request from Queue.py is started. When daj_daemon.py is running use of the Ignition window (select site, activate Go) is the only feature the user need use to process a request. When daj_daemon.py is not running the Ignition window and the Recover entry of the Jobs menu are the only features the user need use to process a request.
The thread will construct by default a jobfiles_dataset name
according to the template:
sg_<d0 release
version>_d0r<d0runjob version>_cf<cardfile
version>
For example
sg_p20.08.02-v3_d0r07-07-02_cfv01-00-10. The cardfile version and d0
release are obtained from the request info. The d0runjob version
may be specified by the user in the dajrc file with keyword
runjob_version. The request info only gives the release version
but not the tarball version in SAM to use.
To specify a particular
tarball version the d0rel_tarball_version keyword may be specified in
the dajrc init file. The value of the d0rel_tarball_version
keyword may be a white space separated list of different d0
release tarballs. The runjob_version keyword likewise specifies
a tarball version (not a list) for example:
Note that normally there is no need to
specify the d0rel_tarball_version and runjob_version keywords because
these values are obtained from a central location.runjob_version = 06-05-10_v2
d0rel_tarball_version = p17.09.06_v2 p17.09.01_v9
When constructing a jobfiles_dataset if necessary these
assumptions are made:
release tarballs begin with 'd0_MC_'; d0runjob tarballs begin
with 'd0runjob_'; cardfile tarballs begin with 'cardFile_';
version numbers follow these prefixes, and '.tar.gz' follow the
version.
Information about the job submission is
written to a persistent
database named jobsdb whose location is determined by the
jobsdb_dir keyword in the dajrc init file and defaults to "./".
The database consists of keyword value pairs with the keyword the
request ID in string form and the value is a list of tuples. Each
tuple corresponds to a grid job for the request ID. The data are
used to construct the merge JDF (see below) as well as
bookkeeping. The script jobsdb.py included with DAJ will dump the
contents of the jobsdb to stdout.
The JDF editor window presents the user with a form to complete the
attributes of a JDF, and buttons to act upon it. The attributes are
color coded. Red must be completed, pink are optional depending on
user preferences and the type of job, and yellow are recommended not
to change. Values initially displayed are defaults. Defaults are
built-in but may be superseded by the values in an initialization
file. The file name of the saved and submitted JDF is specified by
the user. The station_name attribute presents a menubutton that
lists the execution sites known to DAJ for the appropriate job_type,
i.e. runjob sites for monte_carlo, merge, and structured
jobs, and other sites for other job types. New sites can be
accommodated by typing the site into the entry box. See below for
details on the button actions, job attributes, file name.
Buttons
The buttons act on the JDF either created or loaded except the help button with has entries related to the JDF editor window. The chosen topic is opened on the help page in a web browser.
check_consistency
= <Boolean value>
This attribute controls the level of consistency checks that are made
during the grid job submission. The default behavior is that of true
(all checks are made). A value of false results in some checks
(e.g. d0 code version check ) to be skipped. Mandatory checks
(e.g. If input is from SAM) are still performed.
d0_release_version =
<d0 code version>
The version of d0 code that is to be used for producing events for
runjob_requestid. The d0 code version should be consistent with the
version specified in the jobfiles_dataset.
events_per_file =
<number of events per output file>
This attribute states the number of events that are to be produced per
output file (or phase). e.g. events_per_file=250 then a Grid job of
25,000 events will generate 100 files (for each Monte Carlo phase)
containing 250 events in each file. If unspecified, the number of
events per output file will depend on the execution site at which the
grid job executes.
grid_resource_requirement_string =
<grid_resource_requirement_string>
Gatekeeper and jobmanager of OSG or LCG resource,
e.g. red.unl.edu:2119/jobmanager-pbs. Only needed and allowed
for OSG or LCG job submission. OSG submission requires osg-ouhep as
the station_name. LCG submission requires
ccin2p3-grid1 as the station_name.
jobfiles_dataset =
<dataset (snapshot) containing the tar balls>
The jobfiles_dataset is the dataset (snapshot) containing the files
that are necessary for executing the request or doing the
merging. This dataset typically contains but is not limited to, d0
code tree (e.g. d0_p14.03.02.tar.gz), card files (e.g.
cardFile_v00-07-00.tar.gz) and d0runjob code tree (e.g.
d0runjob_v07-05-07.tar.gz). Card files are not required to execute
merging jobs. If they are present in the dataset, they will not affect
the outcome of merging jobs.
merge id xor
dataset = <request number to merge or dataset of files
to merge>
Monte Carlo request number of thumbnail files to be merged or a
dataset name of thumbnail files to be merged. They are mutually exclusive.
minbias_dataset =
<dataset containing minimum bias events to be overlaid>
The files containing minimum bias events that are to be overlaid for
in the digitization phase are specified in this dataset.
monte_carlo_efficiency
= <minimum success rate of a montecarlo job>
The minimum success rate of a montecarlo job required before
starting a merging job. For example if success_rate = 90, then only
if the montecarlo job has produced at least 90% of the events
requested will a merging job be submitted, or else the montecarlo job
is repeated for the remaining number of events.
notify_user = <user email
address> notification
= <Always,Never,Complete,Error> monte_carlo_retries =
<number of times a montecarlo job is retried> phase_dataset =
<dataset containing the input for a phase in the Monte Carlo
chain> phase_dataset_intervals =
<comma separated list of event intervals> runjob_requestid =
<monte carlo request number> runjob_numevts =
<Number of events to produce for the Request Id> station_name =
<stationname> The file_name entry specifies the name of the JDF saved to disk. It
initially displays a default entry. The default for a loaded JDF is
the loaded file name. Changing the file name and saving will clone the
JDF. A save or submit action will overwrite an existing file without
warning. For newly created JDF's the default is a template with the
job_type as the prefix and .ssb as the extension. By default the file
is created in the The SAM portal window allows the querying of SAM for available
jobfiles datasets and displays their contents. The portal will also
create jobfiles datasets from the files available in SAM for D0
release, d0runjob, and cardfiles which are displayed in drop down
menus. The portal is designed to facilitate completion of the
jobfiles_dataset entry box in the JDF window. The widget functions
are described below. There is a status area in the lower button
frame that displays messages in addition to those in the main
window. The portal can also be used to store in SAM cardfile,
d0runjob, and D0 release tarballs to facilitate jobfiles dataset
creation.
The buttons Go, ReScan, New, Store, Close, Help perform various
actions. The menu buttons 'd0 release', 'd0runjob', and 'cardfile'
present menus of available files in SAM for jobfiles dataset creation.
Email address at which the user will be notified when job completes.
The number of times a montecarlo job is retried to produce the
monte_carlo_efficiency before giving up.
If the request takes the input for a particular phase (typically it's
the generation phase) from SAM, then the dataset containing the input
is specified through this attribute. During submission consistency
checks are made to determine if the dataset specified by the
phase_dataset attribute matches the dataset
specified in the request details.
The phase_dataset_intervals are the intervals of events you want to
process for recovery from the phase_dataset. Relevant only for phase
dataset requests and is mutually exclusive with
runjob_numevts.
Example: phase_dataset_intervals = 1-250,501-1000,1251-2000
The request number which has its details present in the request
database. For more information please see
http://www-d0.fnal.gov/computing/mcprod/mcc.html
The number of events to be produced for the Request Id
(runjob_requestid). Mutually exclusive with
phase_dataset_intervals.
The station name at which the job will be executed, assuming that the
requirements are satisfied. If user does not define the station name,
brokering will determine it from a matching station, which is
essentially a random choice. However, station name may be declared if
user prefers a certain station. In DAJ jobs may be sent to a defined
resource pool by specifying its name here.
File Name
tmp_dir directory specified in the dajrc file. The default has the marker '%i' in it
which is replaced when saving with the requestid number, resulting in
file names for example of the form dzero_monte_carlo_99999.ssb.
SAM Portal
Buttons
store_tmp. The directory is then tar'ed and
gzip'ed into a tarball which is then stored in SAM. For the D0
release, the tarball must be pre-made, and the path to it given. The
stored tarballs may then be used in a jobfiles dataset.
Dataset List
The results of the search initiated by the Go button are displayed
in this list box. Double clicking on a selection in the listbox will
initiate a SAM query to display the contents of the dataset. Only
contents of datasets that contain files known to be in SAM and that
are of the four types found in the dataset files menubuttons' lists.
Dataset Files
The four entries display the contents of a dataset that has been
activated by a double click in the dataset listbox. Files are only
displayed if they are of the proper type and known to DAJ. The files
of the proper type known by DAJ are displayed in the menus activated
by the blue menubuttons. These files are determined at program startup
by several threads that query SAM for the desired information. A new
dataset may be created by selecting from the dataset files menus
specifying a dataset name in the entry box and clicking the New button.
Name of dataset to be created by activating the New button.
Dataset Name
$Revision: 1.26 $
Joel Snow
Created July 22, 2004
Revised June 18, 2009