D0 MC Dial-A-Job Help

Contents


Overview

Dial-A-Job (DAJ) provides a simple graphical user interface for D0 Monte Carlo Jim job submission to SAMGrid. With DAJ a user may create Job Description Files (JDFs) for the three job types supported: dzero_monte_carlo, dzero_merge, and structured. The JDF's can be saved and submitted to SAMGrid. The user completes a form whose values are validity checked. To facilitate JDF creation a SAM portal is provided to query SAM for available jobfiles datasets and display their contents. The portal also allows creation of jobfiles datasets from the available files in SAM for D0 releases, d0runjob versions, and cardfiles. An MC request can be reserved for submission to SAMGrid via invocation of the Queue.py script from DAJ. Please see the JIM Job Submission For Monte Carlo Requests page for an introduction to JIM job submission and the SAMGrid Manual for an in depth discussion of SAMGrid.

DAJ also provides one click submission of production and merging jobs, and recovery job submission that asks only for the request Id. The simplest method of starting a production job is to use the Ignition window where the user selects a site to run production of an MC request, and then activates a button to start a request. When daj_daemon.py is running use of the Ignition window (select site, activate Go) is the only feature the user need use to process a request. When daj_daemon.py is not running the Ignition window and the Recover entry of the Jobs menu are the only features the user need to process a request.

DAJ is written in python and should run on versions 2.3 or later. DAJ uses the Tk interface to python and hence depends on the Tk tool kit being installed.


Usage

DAJ operates in two modes, normal and remote. Normal mode is the default and is meant to be used from a machine where sam and jim_client are able to be setup. In the case where the machine from which DAJ is running is not able to setup sam and jim_client, remote mode may be used. Remote mode is invoked by passing the argument 'remote' on the command line to daj.py. Remote mode requires password-less ssh entry to an account on a machine where sam and jim_client can be setup. The script remote_daj must be placed on the remote machine. The location of the remote_daj script is specified in the dajrc initialization file. Remote mode is discussed further below.

In normal mode DAJ will refuse to start until the environment is correct. The user must setup sam, jim_client, and obtain a valid grid proxy before the DAJ main window appears. The grid proxy may be obtained via a DOEGrids certificate or from a valid Kerberos ticket for the FNAL.GOV realm. A grid certificate is preferred because of the ability to set long proxy lifetimes.

The DAJ main window has a message area and menubar. From the File menu the user can choose to create a JDF for a supported job type or load an existing JDF. From the Tools menu the user may open a SAM portal window or invoke the Queue.py script to reserve a request to submit to the grid.

Upon choosing JDF type from the Create submenu a JDF form window appears. The job attribute fields are filled in by the user who may then act on the form through the buttons in the window. From this window JDF's can be saved to disk and submitted to SAMGrid. Before being saved or submitted the JDF undergoes a validity test. See JDF Window for more detail. The SAM portal window allows the querying of SAM for available jobfiles datasets and displays their contents. The portal will also create jobfiles datasets from the files available in SAM for D0 release, d0runjob, and cardfiles which are available from drop down menus. The portal is designed to facilitate completion of the jobfiles_dataset entry box in the JDF window. See SAM Portal for more detail.

DAJ provides access to every function from the keyboard as well as the pointing device.

DAJ can also provide one click submission of production and merge jobs. By supplying certain job configuration settings in the dajrc initialization file an activation of the File/Auto/Monte_Carlo menu item will reserve the next request by running Queue.py, get the request information from SAM, create a JDF from this info and the settings info provided by the user in the initialization file, and submit the job. The submission of a job creates a record in a persistent database that is used when the File/Auto/Merge menu item is activated. The activation of this item queries the user for a request id then extracts production job information for this request from the database and constructs a merge job jdf which is then submitted. For more details see Auto below.

Command Line Options

The command line arguments available are remote, test, fast, and help.

Initialization

DAJ supports an optional initialization file, dajrc, to specify job attribute defaults and user preferences. See the supplied dajrc.template file as an example. The customizations available are listed below. Only some JDF attributes are reasonable to specify in the initialization file for instance minbias_dataset and notify_user. The file dajrc is searched for in the order, value of DAJDRC environment variable if defined then in the working directory. The initialization file ignores blank lines and lines beginning with '#'. The significant lines are key value pairs separated by an equals sign. The attribute keys are their names. The preference keys with defaults in parentheses are:

The value of the d0rel_tarball_version keyword may be a white space separated list of different d0 release tarballs. The last three items in the list above are used for one click monte_carlo job submission. See Auto below for more details. Note that normally there is no need to specify the d0rel_tarball_version and runjob_version keywords because these values are obtained from a central location on startup.

See the file dajrc.template for an example. Note that the remote_daj value must end with the following characters between the double quotes: " ''" (i.e. space-single quote-single quote).

Remote Mode

Remote mode allows DAJ to function on a machine unable to setup sam or jim_client. Remote mode is invoked by passing the argument 'remote' on the command line to daj.py. Remote mode requires password-less ssh entry to an account on a machine where sam and jim_client can be setup. The script remote_daj must be placed on the remote machine. The location of the remote_daj script is specified in the dajrc initialization file. In remote mode no environment check is made before DAJ starts. Commands are executed on the remote machine via ssh invocation of the remote_daj script. In the case of JDF submission to SAMGrid, the JDF file is first copied to the directory on the remote machine where the remote_daj script resides according to the value of the remote_daj variable specified in the initialization file. Then a check for a valid grid proxy is made on the remote machine. If one isn't found the user is prompted for the Grid pass phrase of the user's grid identity or the Kerberos password for a ticket in the FNAL.GOV realm so that a proxy can be obtained. A command is then issued to the remote machine to submit the JDF file.

The remote_daj script needs to be tailored to the system on which it resides. There are configuration variables in the script that can be modified to accomplish the tailoring. These variables, their defaults, and their meaning are listed below.

For the 'Get Request' and 'Request Audit' features to be functional in remote mode the files Queue.py and request_audit.py (both included with DAJ) must be placed on the remote machine in the same directory. The location of that directory is specified in the initialization file.

Depending on connectivity and load remote commands may be executed much more slowly than in normal mode so patience on the part of the user is required.

OSG & LCG Job Submission

Job submission to OSG and LCG sites can be done by specifying osg-ouhep as the station_name for the OSG or ccin2p3-grid1 as the station_name for the LCG, and supplying a grid_resource_requirement_string value. These can be specified via the Create entries of the Jobs Menu, or via the Ignition window. Job submission to OSG requires valid grid credentials in the Fermi myproxy database. See Get Credentials below for more information. OSG job submission requires that the Fermi product vdt v1_3_2_3 be installed though not declared current on the local daj.py machine or at the remote_daj site. Note this is not the version of vdt used for the JIM products.

Resource Pool Job Submission

Jobs may also be send to an external broker for disposition through the use of resource pools. The only external broker service presently implemented is the ReSS for OSG. The ReSS pool is specified as the grid_resource_requirement_string value in the JDF. The value is constructed by specifying each computing element as the value of the GlueCEInfoContactString keyword and logically and'ing these together with this syntax:

grid_resource_requirement_string = (GlueCEInfoContactString == CE1) || (GlueCEInfoContactString == CE2) [... || (GlueCEInfoContactString == CEn)]
Where CEi is the i'th computing element in the resource pool which is specified as a standard OSG CE. The station name is given as the OSG station.

The grid_resource_requirement_string easily becomes cumbersome with just a few sites in the resource pool. To address this the site resource parameter may also be the name of a pool resource object. The resource pool name may be used as the value of the station_name attribute in the JDF form, and is displayed as a job site in the Ignition window. In this case no grid_resource_requirement_string is specified. The pool resource object is defined in the daj_pooldef file. For a resource pool the identifier name in the file is of the form: name@pooltype;ce1,ce2,...,cen. Where cei has the form gate.keeper.address:port/jobmanager-type e.g.

ress1@resspool;grid1.oscer.ou.edu:2119/jobmanager-lsf,osg-gw-2.t2.ucsd.edu:2119/jobmanager-condor 2
The cei's must be a known resource defined in sites.py and be appropriate to the type of pool. Only resource pool type resspool for OSG sites is implemented. For pool definitions lines are continued if the last character on a line is a backslash (\) e.g.
ress1@resspool;grid1.oscer.ou.edu:2119/jobmanager-lsf,\
               osg-gw-2.t2.ucsd.edu:2119/jobmanager-condor 2
The daj_pooldef file is read at program startup and may be edited and reread by invoking the 'Read Pool Defs' item in the File menu. The defined resource pools may be inspected by invoking the 'Show Pool Defs' item in the Settings menu. See the file daj_pooldefs.template for an example.

DAJ Window

The DAJ main window has a message area and menubar. Messages are reported here from the main window, JDF editor, and SAM portal.

Available menus are File, Tools, Settings, and Help. The Help menu has entries related to all functions of DAJ. Choosing an entry opens the topic on the help page in a web browser. Links to the MonteCarlo Production page, SAM Dataset Definition Query page, and SAMGrid Monitoring pages are also provided. The Settings menu allows customization of user settings. The menu items are discussed below.

File Menu

Load JDF
Loads an existing JDF from disk for submission, editing, or cloning. The JDF read from disk is opened in the appropriate JDF editor form.
Read Pool Defs
Read the resource pool definition file whose name is given as the value of the pool_def preference key in the dajrc initialization file.
Save messages
Saves messages in the Message window to a time stamped file.

Jobs Menu

Ignition
Opens a window with a list from which the user can choose a site or defined resource pool for job execution. The selection, if the Set button is activated, sets the proper station_name and grid_resource_requirement_string attributes (if necessary) as the default values. This overrides the initial values if set in dajrc. Activation of the Go button starts a production job. This has the same effect as selecting the Auto > Monte Carlo entry of the Jobs menu.

The user may also specify a request to start in the entry box of the Ignition window. If a request is specified an initial production job for the request is constructed and submitted. This will bypass getting the next request via Queue.py and therefore will not change the request status. This should only be used when appropriate and is not the usual way to process a request. If no request is specified in the Ignition window the next request from Queue.py is started. When daj_daemon.py is running use of the Ignition window (select site, activate Go) is the only feature the user need use to process a request. When daj_daemon.py is not running the Ignition window and the Recover entry of the Jobs menu are the only features the user need use to process a request.

Recover
Submits a recovery job for a request Id entered by the user. The request Id must have job info in the jobs database. This is the default for jobs submitted by DAJ. The information in the jobs database and the request details in SAM enable the determination of which type of recovery job is necessary, production or merge, and the creation and submission of the JDF to run the recovery job.
Auto
Presents a sub-menu to choose a jobtype for single click job submission. If there is a problem with single click job submission a JDF Window is opened for manual intervention.
Monte_Carlo
When activated a thread starts that will reserve the next request using Queue.py, retrieve request information from SAM, build a JDF for a dzero_monte_carlo job based on the request info and user settings specified in the dajrc initialization file, create a jobfiles_dataset if a suitable one does not exist, and submit the job. For this to work the user must supply some information in the dajrc init file. The minimum two pieces of information are d0runjob version used in the jobfiles_dataset and the station_name to be used in the JDF. For submission to OSG sites the station_name must be osg-ouhep. For submission to LCG sites the station_name must be ccin2p3-grid1. For both OSG and LCG the proper grid_resource_requirement_string must be supplied in the dajrc file or set by using the Ignition window. Valid grid credentials are also required.

The thread will construct by default a jobfiles_dataset name according to the template:
sg_<d0 release version>_d0r<d0runjob version>_cf<cardfile version>
For example sg_p20.08.02-v3_d0r07-07-02_cfv01-00-10. The cardfile version and d0 release are obtained from the request info. The d0runjob version may be specified by the user in the dajrc file with keyword runjob_version. The request info only gives the release version but not the tarball version in SAM to use. To specify a particular tarball version the d0rel_tarball_version keyword may be specified in the dajrc init file. The value of the d0rel_tarball_version keyword may be a white space separated list of different d0 release tarballs. The runjob_version keyword likewise specifies a tarball version (not a list) for example:

runjob_version = 06-05-10_v2
d0rel_tarball_version = p17.09.06_v2 p17.09.01_v9
Note that normally there is no need to specify the d0rel_tarball_version and runjob_version keywords because these values are obtained from a central location.

When constructing a jobfiles_dataset if necessary these assumptions are made:
release tarballs begin with 'd0_MC_'; d0runjob tarballs begin with 'd0runjob_'; cardfile tarballs begin with 'cardFile_'; version numbers follow these prefixes, and '.tar.gz' follow the version.

Information about the job submission is written to a persistent database named jobsdb whose location is determined by the jobsdb_dir keyword in the dajrc init file and defaults to "./". The database consists of keyword value pairs with the keyword the request ID in string form and the value is a list of tuples. Each tuple corresponds to a grid job for the request ID. The data are used to construct the merge JDF (see below) as well as bookkeeping. The script jobsdb.py included with DAJ will dump the contents of the jobsdb to stdout.
Merge
When activated a thread starts that prompts for the request ID to merge, then constructs a merge job JDF from the information in the jobsdb database for that request including the jobfiles_dataset, station_name, and d0_release_version, and then submits the job. A merge_dimension_query is specified in the JDF of the form "appl_name d0reco and data_tier unmerged-thumbnail and global.requestid=" as when the request ID is entered in the "merge_id_xor_dataset" attribute of the merge job JDF window.
Create
Presents a sub-menu to choose a JDF jobtype to create. Available jobtypes are dzero_monte_carlo, dzero_merge, and structured. A structured job by default is a dzero_monte_carlo job followed immediately by a dzero_merge job which operates on the output thumbnails of the dzero_monte_carlo stage of the structured job. A chosen entry will open the JDF editor window to create a new JDF. See JDF Window below.

Tools Menu

SAM Portal
Opens the SAM portal to facilitate the completion of the jobfiles_dataset entry in the JDF editor. See SAM Portal below.
Get Request
Starts Queue.py in an xterm. Queue.py reserves the next prioritized request in the D0 MC system. The location of Queue.py defaults to the current working directory. The location may be customized by an entry in dajrc. See Initialization above.
Request Info
Displays Monte Carlo request information in message area via SAM query.
Request Audit
Audits the status of a request by displaying the number of unmerged and merged tmb's existing for the request. For requests with a phase_dataset a consistency check is done for the event intervals of the phase_dataset and any missing, overlap, subset, and duplicate event intervals are determined.
For this function to work the file request_audit.py, which is included in the DAJ tarball, must be placed in the same directory as the Queue.py file. This function can take some time to complete if a request with a phase_dataset is specified. The file request_audit.py may be used standalone from the command line.
Fix Remerge
Fixes problems when unmerged tmbs are already merged and when duplicate merged files exist. Audits the status of a request by displaying the number of unmerged and merged tmb's existing for the request. For requests with a phase_dataset a consistency check is done for the event intervals of the phase_dataset and any missing, overlap, subset, and duplicate event intervals are determined.
For this function to work the file fix_remerge.py, which is included in the DAJ tarball, must be placed in the same directory as the Queue.py file. The file fix_remerge.py may be used standalone from the command line.
Stop Job
Stop a grid job.
Get Credentials
Obtain grid credentials. The 'Grid proxy' menu entry will try to get a proxy from a grid certificate or Kerberos ticket. This is used for non-OSG job submission. The 'Myproxy' menu entry will try to get a grid proxy and enter it in the myproxy database at Fermilab. The myproxy credentials are used for OSG job submission.
Dump Database
Display contents of jobs database.
Close Request
Set SAM status of request to completed.
Set Request Status
Set SAM status of request.
Handler Editor
View, edit, or create request handlers associated with requests. The request handler attributes, number of events and status, must be set properly before closing a request. See the RHET help page for more details.

Settings Menu

Dajrc
Show, edit, and reload dajrc settings.
Show Pool Defs
Show resource pool definitions.
Reload Sites
Reloads the site information.

JDF Window

The JDF editor window presents the user with a form to complete the attributes of a JDF, and buttons to act upon it. The attributes are color coded. Red must be completed, pink are optional depending on user preferences and the type of job, and yellow are recommended not to change. Values initially displayed are defaults. Defaults are built-in but may be superseded by the values in an initialization file. The file name of the saved and submitted JDF is specified by the user. The station_name attribute presents a menubutton that lists the execution sites known to DAJ for the appropriate job_type, i.e. runjob sites for monte_carlo, merge, and structured jobs, and other sites for other job types. New sites can be accommodated by typing the site into the entry box. See below for details on the button actions, job attributes, file name.

Buttons

The buttons act on the JDF either created or loaded except the help button with has entries related to the JDF editor window. The chosen topic is opened on the help page in a web browser.

Submit
Does a save on the JDF in the editor and is then submitted to SAMGrid. Adds a job entry to the jobsdb database used in single click job submission and bookkeeping.
Save
Writes the JDF in the editor to the file system using the name in the file_name entry box.
Close
Closes the JDF editor without saving.
Reset
Restores attribute entries to the defaults.

Job Attributes

check_consistency = <Boolean value>
This attribute controls the level of consistency checks that are made during the grid job submission. The default behavior is that of true (all checks are made). A value of false results in some checks (e.g. d0 code version check ) to be skipped. Mandatory checks (e.g. If input is from SAM) are still performed.

d0_release_version = <d0 code version>
The version of d0 code that is to be used for producing events for runjob_requestid. The d0 code version should be consistent with the version specified in the jobfiles_dataset.

events_per_file = <number of events per output file>
This attribute states the number of events that are to be produced per output file (or phase). e.g. events_per_file=250 then a Grid job of 25,000 events will generate 100 files (for each Monte Carlo phase) containing 250 events in each file. If unspecified, the number of events per output file will depend on the execution site at which the grid job executes.

grid_resource_requirement_string = <grid_resource_requirement_string>
Gatekeeper and jobmanager of OSG or LCG resource, e.g. red.unl.edu:2119/jobmanager-pbs. Only needed and allowed for OSG or LCG job submission. OSG submission requires osg-ouhep as the station_name. LCG submission requires ccin2p3-grid1 as the station_name.

jobfiles_dataset = <dataset (snapshot) containing the tar balls>
The jobfiles_dataset is the dataset (snapshot) containing the files that are necessary for executing the request or doing the merging. This dataset typically contains but is not limited to, d0 code tree (e.g. d0_p14.03.02.tar.gz), card files (e.g. cardFile_v00-07-00.tar.gz) and d0runjob code tree (e.g. d0runjob_v07-05-07.tar.gz). Card files are not required to execute merging jobs. If they are present in the dataset, they will not affect the outcome of merging jobs.

merge id xor dataset = <request number to merge or dataset of files to merge>
Monte Carlo request number of thumbnail files to be merged or a dataset name of thumbnail files to be merged. They are mutually exclusive.

minbias_dataset = <dataset containing minimum bias events to be overlaid>
The files containing minimum bias events that are to be overlaid for in the digitization phase are specified in this dataset.

monte_carlo_efficiency = <minimum success rate of a montecarlo job>
The minimum success rate of a montecarlo job required before starting a merging job. For example if success_rate = 90, then only if the montecarlo job has produced at least 90% of the events requested will a merging job be submitted, or else the montecarlo job is repeated for the remaining number of events.

notify_user = <user email address>
Email address at which the user will be notified when job completes.

notification = <Always,Never,Complete,Error>

monte_carlo_retries = <number of times a montecarlo job is retried>
The number of times a montecarlo job is retried to produce the monte_carlo_efficiency before giving up.

phase_dataset = <dataset containing the input for a phase in the Monte Carlo chain>
If the request takes the input for a particular phase (typically it's the generation phase) from SAM, then the dataset containing the input is specified through this attribute. During submission consistency checks are made to determine if the dataset specified by the phase_dataset attribute matches the dataset specified in the request details.

phase_dataset_intervals = <comma separated list of event intervals>
The phase_dataset_intervals are the intervals of events you want to process for recovery from the phase_dataset. Relevant only for phase dataset requests and is mutually exclusive with runjob_numevts.
Example: phase_dataset_intervals = 1-250,501-1000,1251-2000

runjob_requestid = <monte carlo request number>
The request number which has its details present in the request database. For more information please see
http://www-d0.fnal.gov/computing/mcprod/mcc.html

runjob_numevts = <Number of events to produce for the Request Id>
The number of events to be produced for the Request Id (runjob_requestid). Mutually exclusive with phase_dataset_intervals.

station_name = <stationname>
The station name at which the job will be executed, assuming that the requirements are satisfied. If user does not define the station name, brokering will determine it from a matching station, which is essentially a random choice. However, station name may be declared if user prefers a certain station. In DAJ jobs may be sent to a defined resource pool by specifying its name here.

File Name

The file_name entry specifies the name of the JDF saved to disk. It initially displays a default entry. The default for a loaded JDF is the loaded file name. Changing the file name and saving will clone the JDF. A save or submit action will overwrite an existing file without warning. For newly created JDF's the default is a template with the job_type as the prefix and .ssb as the extension. By default the file is created in the tmp_dir directory specified in the dajrc file. The default has the marker '%i' in it which is replaced when saving with the requestid number, resulting in file names for example of the form dzero_monte_carlo_99999.ssb.


SAM Portal

The SAM portal window allows the querying of SAM for available jobfiles datasets and displays their contents. The portal will also create jobfiles datasets from the files available in SAM for D0 release, d0runjob, and cardfiles which are displayed in drop down menus. The portal is designed to facilitate completion of the jobfiles_dataset entry box in the JDF window. The widget functions are described below. There is a status area in the lower button frame that displays messages in addition to those in the main window. The portal can also be used to store in SAM cardfile, d0runjob, and D0 release tarballs to facilitate jobfiles dataset creation.

Buttons

The buttons Go, ReScan, New, Store, Close, Help perform various actions. The menu buttons 'd0 release', 'd0runjob', and 'cardfile' present menus of available files in SAM for jobfiles dataset creation.

Go
Initiates searching of SAM for the datasets that match the dataset mask entry box. For example all datasets beginning with jms can be found by having jms% in the dataset mask entry box and clicking Go. The names of the datasets found are listed in the dataset list listbox.
New
Creates a new dataset in SAM with the name found in the dataset name entry box. The contents of the dataset are the files selected using the blue menubuttons and displayed in the dataset files boxes. Typically only the cardfile entry is optional. It is not required if there is no generation phase in the request.
ReScan
Rescans available tarballs in SAM for jobfiles dataset creation. This button may be used after storing tarballs in SAM to have them show up in the lists of available tarballs.
Store
Stores a tarball in SAM. The type menu button specifies the type of tarball to store. The version entry box specifies the version to store for cardfiles and d0runjob. For D0 release the version entry box becomes the file path for the tarball to store. For this to work the machine running daj.py or the remote machine must have the Fermi d0cvs product installed and the user must have access to the D0 cvs code repository. For cardfiles and d0runjob the action of this button checks out of the repository the product and version specified and creates a package directory structure in store_tmp. The directory is then tar'ed and gzip'ed into a tarball which is then stored in SAM. For the D0 release, the tarball must be pre-made, and the path to it given. The stored tarballs may then be used in a jobfiles dataset.
Close
Closes the SAM portal window.

Dataset List

The results of the search initiated by the Go button are displayed in this list box. Double clicking on a selection in the listbox will initiate a SAM query to display the contents of the dataset. Only contents of datasets that contain files known to be in SAM and that are of the four types found in the dataset files menubuttons' lists.

Dataset Files

The four entries display the contents of a dataset that has been activated by a double click in the dataset listbox. Files are only displayed if they are of the proper type and known to DAJ. The files of the proper type known by DAJ are displayed in the menus activated by the blue menubuttons. These files are determined at program startup by several threads that query SAM for the desired information. A new dataset may be created by selecting from the dataset files menus specifying a dataset name in the entry box and clicking the New button.

Dataset Name

Name of dataset to be created by activating the New button.


$Revision: 1.26 $
Joel Snow
Created July 22, 2004
Revised June 18, 2009