2.1 Overview of the SAM-Grid architecture
3 Installation of the SAM-Grid
3.1.4 Summary of the activities as root
3.1.4.2 Open Ports for Incoming TCP connections
3.1.4.3 Enable Automatic Restart of SAMGrid servers at Boot Time
3.1.4.4 Setup the /etc/grid-security and xinetd daemon (Execution Site Only)
3.1.5 Packages and Samgrid Production Release Cuts
3.2.1 Installing and Configuring Condor and Globus
3.2.2 Installing and Configuring the Grid Security Infrastructure
3.2.3 Updating the Grid Security Infrastructure
3.2.4 Get a Service Certificate
3.2.6 Store the SAM Grid Global Constants to the XML database
3.4 Submission Site Installation
3.4.2 Installation of the JIM Broker client
3.4.3 Installing Output retrieval via web
3.5 Execution Site Installation
3.5.3 Setting up durable location (Optional)
3.5.5 Get the list of users authorized to use the resources (gridmap-file)
3.5.6 Install SAM-Grid Globus job-managers and sandboxing mechanisms
3.5.7 Creating the Resource Description
3.5.8 Installing the resource advertisement software
3.6 Monitoring Site Installation
3.6.1 Create site Configuration
5 Modifying the Product Configuration
6 Automating the Maintenance Tasks
6.1 Regular Cleanup and Maintenance Tools
6.1.1 Cleaning up old Globus files and jim sandboxes
6.1.2 Cleaning up CondorG queue for OSG jobs
6.1.3 Cleaning up CondorG queue for Samgrid jobs
6.1.4 Rotate log files daily and archive them Monthly
6.1.5 Relocate condor job spool directories for jim_broker_client
6.2 Automate security setup tasks
6.2.1 Generate gridmapfile for jim_broker_client from the DZero member list in voms
6.2.2 Automatically fetch the latest CA certificate files and update samgrid ca files.
7.1.1 A typical SAM Analysis Job submission
9.1.1 Common JDL Specifications
9.1.2 SAM Analysis JDL Specifications
9.1.4 Monte Carlo JDL Specifications
9.1.5 Merge Job JDL Specifications
9.1.5.2 Mutually Exclusive attributes
9.1.6 Structured Job JDL Specifications
The manual has to be read sequentially. As you read, there will be pointers that will guide you to perform site specific installation e.g. "skip to submission site installation". If these pointers match your desired installation you may follow the pointer and then again you need to follow sequentially till the manual marks end of the site specific installation.
SAM-Grid is a virtual project whose core is the D0-PPDG group at Fermilab and which includes off-site D0 collaborators under the aegis of various Grid projects. It's mission is to enable fully distributed computing for D0 and CDF, by:
· Enhancing SAM as the distributed data handling system of the experiments.
· Incorporating standard Grid tools and protocols.
· Developing new solutions for Grid computing together with Computer Scientists.
Under this mission, the project strives to unite the D0 efforts from the multifarious Grid activities (PPDG, EU DataGrid, GridPP and more), off-site analysis work and other aspirations distributed throughout the D0 collaboration. The two main areas of work are Job Handling (including specification, brokering, scheduling etc.) and Monitoring and Information Services.
The SAM-Grid is a software suite that addresses the globally distributed computing needs of the Run II experiments at Fermilab. The Job and Information Management (JIM) components complement the Data Handling system of the experiments (SAM), providing the user with transparent remote job submission, data processing and status monitoring.
The logical entities of the SAM-Grid consist of
1. Multiple Execution Sites
2. A central Resource Selector[1]
3. Multiple Job Submission Sites
4. Multiple Clients (User Interface) to the Job Submission Sites.
Servers at the Job Submission Sites and at the Execution Sites register with the Resource Selector. Users describe and submit jobs to the Submission Sites via a User Interface, ultimately installed on a laptop. The Submission Sites maintain a spool of jobs that are periodically matched with the available resources. Matches are currently ranked by the Resource Selector according to the number of files of interest to the job that are already present at the Execution Site. Submission Sites are then responsible to reliably dispatch the job to the Execution Site. Typically, Submission Sites will also spool job outputs.
Typical resources at the execution site consist of
1. A Local Resource Management system
2. A SAM Station
3. An Information Manager
The Local Resource Management system generally has experiment specific interfaces[2] and is based on a Batch System; it is responsible to receive and process jobs from the Submission Site. The SAM Station is a collection of resources managed by a set of services to satisfy Data Handling requests from individual jobs or other entities, like the Information System or the Resource Selector. It generally manages a pool of disk caches and may be interfaced to a local Mass Storage System. SAM Stations rely on a set of supporting services, some of which are distributed some are central. The Information Manager provides service configuration support and monitoring of status information. Each Site advertises resource availability to the Resource Selector.
A site can join the SAM-Grid in four ways:
NOTE: Make sure to follow the instructions printed out at installation time.
DISCLAIMER: installing any of the JIM packages will drive you through the installation of Globus: the installation will be MUCH easier if the product area is NOT NFS shared. However, below you will find instructions on how to install Globus in this scenario as well.
Since the current focus of the SAM-Grid development is enabling distributed SAM analysis jobs, the discussion below assumes the site runs a SAM station. Please, refer to http://d0db.fnal.gov/sam/doc/install/ for instructions tailored to the DZero environment, http://cdfdb.fnal.gov/sam/doc/cdf/install/install.html to CDF.
The requirements will vary depending on configuration and custom installation choices.
|
Memory |
128 MB of RAM (256Mb recommended) |
|
Hard Disk |
1 GB (recommended) |
|
Processor |
Intel x86 processor (Pentium II (or) above recommended) |
|
Linux |
|
|
UPS/UPD |
>= 4.7 The packaging tool used for the SAM Grid is ups/upd. The installation of Globus will not work if you use an earlier version. If you need to install ups/upd, please go to http://www.fnal.gov/docs/products/ups/ . If ups/upd is installed on your system already, generally you have to source a setup file: /usr/local/etc/setups.(c)sh for typical installation and DZero, ~cdfsoft/cdf2.(c)shrc for CDF. |
· Create a local ups product area, where all the SAM-Grid products will be installed. We strongly recommend that this area is owned by user sam: see ftp://ftp.fnal.gov/products/bootstrap/current/index.html#unix_user to create such a product area.
· Create a local user called sam. Optionally, create a user called samgrid to enable generic authorized grid users to run jobs (this is optional, since users can be mapped to individual accounts, but highly recommended).
· Create a directory writable by user sam, named e.g. "jim". Initialize the environment variable SAMGRID_LOCAL_DIRECTORY to point to it. This is optional but will make installation easier. This is the area used by SAMGrid products during runtime to do their activities, including sandboxing
In order to install the whole JIM software suite, root access is needed for the following actions:
SAMGrid’s servers typically run under, and use files belonging to, the “sam” UNIX account. Thus, an absolute minimum requirement is to have the “sam” account setup. Whereas it is possible to run the SAMGrid servers under another account, doing so will greatly complicate our support.
In the past, the SAM team also recommended the “products” account for use by the UPS/UPD system. This account exists on nearly all the FNAL systems. For our purposes, we realize that, outside of FNAL, UPS/UPD is installed solely for computing with SAM and therefore a separate user for merely owning the products files is hardly necessary. Moreover, the distinction between SAM and products creates numerous problems with permissions as our servers (especially third-party software) often write files at run-time that belong to “products” unless specifically changed. We therefore strongly recommend installing and maintaining products as user “sam”.
For an execution site, depending on your local policies, you need to give authorization for off-site (relative to your site, not FNAL) users to execute jobs (please note that, by definition, this is required for your site to be part of the Grid). You may choose to map external authorized users to local “sam” account (which potentially might interfere with the SAMGrid server operation) or another group account such as “samgrid”.
Opening ports in the firewall from the head node (NB: SAMGrid does NOT require direct connectivity between worker nodes and the Internet):
grid gatekeeper: (execuition site only) 2119 Open to all Submission Sites. See the Section on the Architecture for definition and http://samgrid.fnal.gov:8080/ for the list of the currently known submission sites. “Open to the world” would enable us to add new submission sites without changing the configuration of all the execution sites.
job-managers: (execution site only) Any contiguous range of N ports also open to the Submission Sites where N is the number of concurrently running Grid jobs (A Grid job is “running” if it has been submitted to your local batch system). We recommend a number on the order of 100. Same consideration as above for “open to the world”. In order to have the gatekeeper use this port range, it needs to be started (e.g. via xinetd) with the environment variable GLOBUS_TCP_PORT_RANGE = 50001,50100 (example)
condor_schedd: (submission site only) any contiguous range of M ports, where M is the maximum number of Grid jobs currently submitted through your site. Open to all Client machines authorized to use your submission site. (If all the authorized client machines are behind the same firewall, you do not need to open any of these ports.) Add to the $CONDOR_CONFIG file of jim_broker_client the macro HIGHPORT=port1 and LOWPORT=port200.
grid MDS: (monitoring site only) 2135 Open to samgrid.fnal.gov, better to FNAL to enable possible fall over mechanisms.
tomcat: (all
site suites, but client) 7080 open to samgrid.fnal.gov, enables configuration
management via the XML Database and job’s output retrieval by the users.
(submission site only) 7081 GSI-secured door (optional), open to
samgrid.fnal.gov and all the client machines which will provide for the secure
job cancellation by the users.
If the site runs a SAM station, these are the ports that needs to be opened:
sam: 4550-4555 Open to FNAL. This is
required for CORBA callbacks by SAM servers. At absolute minimum, the list
should include d0mino.fnal.gov (or any other D0 FNAL data router station) and
d0db[-dev].fnal.gov for D0 and cdfdb.fnal.gov for CDF. Use option
--OAport=portNum to define on what port a given SAM server is listening.
sam_dcache_cp: (CDF only) 25126 and 2811 Mainly to cdfdca.fnal.gov (for access to the CDF DCache system). D0 dcache systems to come soon. See also sam_gridftp client.
sam_gridftp server: 4567 (control) + any contiguous range of K ports (data) open to all sites to which will be allowed to pull data out of your site. NB: These should include the headnodes of all the SAM stations if you want to be considered part of the Grid!
sm_gridftp client: Any contiguous range of K ports for data, where K is the number of simultaneous transfer streams initiated by your site, must be open to all sites where your site will pull/push data (at a minimum, d0mino.fnal.gov for D0). This number must also match the number of parallel transfers set in the external SAM stager.
sam_bbftp server: (deprecated by grid_ftp). Open 14021 as described under the sam_gridftp server.
sam_bbftp client: (deprecated by grid_ftp) All ports must be open to d0mino.fnal.gov (D0) and other sites where your site will push data.
More information on the requirements posed on firewalls by the Globus Toolkit at http://www.Globus.org/security/v2.0/firwalls.html
Exact means for this vary and depend on the local administrator’s preferences. A typical way is to modify the /etc/rc.local so that it includes a line similar to this:
su SAM –c /home/sam/samgrid_start.sh
Also see the Section on server start-up.
See Sections on configuring GSI and installing Globus gatekeepers (a.k.a. resource manager bundle).
The requirements of other packages are driven by the type of configuration you choose and are listed on their respective sections. For each type of installation we have laid out the list of packages below.
You can find the latest Samgrid production cut at http://www-d0.fnal.gov/computing/grid/releases/
This refers to the general installation procedures required for by all the Site installation, unless specified.
The SAM-Grid uses the Condor and Globus middleware distributed by the Virtual Data Toolkit. The VDT product in ups is a wrapper around pacman: the software comes from the official VDT web site.
It is important that there is no variable in the environment that points to other instances of Globus while installing this new instance. You can check e.g. if GLOBUS_LOCATION or GPT_LOCATION are already defined or that PATH includes paths to other installations of Globus. In that case, check e.g. ~/.shrc and /etc/profile (or similar environment bootstrapping files) to eliminate such definitions during the installation phase.
|
Product |
VDT |
|
Install as |
Sam |
|
Install operation |
upd install VDT -G-c |
|
Tailor as |
Sam or Root (see below) |
|
Tailor Operation |
Before tailoring make sure that your system have 1. the “patch” command 2. “gcc” (appropriate version for you Linux distribution) 3. a ‘recent’ version of tar: v1.13.12 or newer. More info at http://www.cs.wisc.edu/VDT/
as user sam: $ ups tailor VDT
as user root: $ ups InstallAsRoot VDT
Notes: · Because tailoring is CPU and I/O intensive, beware that 1. On some systems this command can take 30 min. 2. Installations on NFS mounted disk can give I/O related problems · the script executed as root changes the xinetd config files and restarts the xinetd daemon. · At the end of the installation, the location of the installation log will be printed out. Look at it for potential problems.
Notes for experts: ·
to change the default location of the gatekeeper gass cache,
add this line to the xinetd configuration file the line · to let the gatekeeper know what ports are open in your firewall to run the job-managers, add something like this line to the xinetd configuration file: env = GLOBUS_TCP_PORT_RANGE = 50001,50100
|
This product configures the Globus Security Infrastructure of your system.
|
Product |
sam_gsi_config |
|
Install as |
Sam |
|
Install operation |
upd install sam_gsi_config –q VDT -G-c |
|
Tailor as |
Sam or Root (see below) |
|
Tailor Operation |
$ups tailor sam_gsi_config –q VDT The tailoring procedure configures GSI for various SAM-Grid products. You will be asked for what products you want to install GSI. If you do not know, configure it for all of them. The script will print out what user(s) need to execute the command below. Typically, you need to execute it as user SAM and as root (for execution site installation): $ups install_ca sam_gsi_config –q VDT
|
If you are installing either Client site (or) Monitoring site, please skip to the site specific installation. Otherwise, Submission site & execution site installers read further.
Skip to Monitoring Site Installation
This paragraph describes what to do when a CA certificate has expired and needs to be replaced. It assumes a working sam_gsi_config installation. Also, you must know the fingerprint string of the expired CA.
|
Product |
sam_gsi_config |
|
Update as |
products and/or SAM and/or root (see later) |
|
Update Operation |
If your sam_gsi_config installation is older than v2_0_8, first do $ups update_config sam_gsi_config –q VDT Update a CA certificate as: $setup sam_gsi_config –q VDT $sam_gsi_install_ca --fingerprint=<fingerprint_hash> Where fingerprint_hash is a string of the form e1fce4e9
Instructions on what other users should execute this command will be printed on the screen. To force the installation as a user different from the one recommended by sam_gsi_config, add the option --force-user |
Request a SAM service certificate to the DOEGrids CA. If you want to use a CA other than DOEGrids, this may be fine: please send email to cdfsam-admin@fnal.gov or d0sam-admin@fnal.gov.
If you are installing an execution site, you will also need to get a host certificate: you may want to get it now. Follow instructions at Get a host certificate.
|
As user |
Sam |
|
Operations |
$ setup sam_gsi_config -q VDT
$ sam_cert_request Follow instructions on the screen. Notes: The command above will drive you through the request of a SAM service certificate (typically 1 day response). When you receive by email your signed certificate, save it as is in the location printed on the screen and make it owned by user “sam”.
More detailed instructions for the installation of a SAM service certificate for sam_gridftp at http://d0db.fnal.gov/sam/doc/install/fileTransfer.shtml#sam_gridftp |
This is an xml database server. It is currently implemented using the Xindice database and is used within the SAM-Grid as the interface that the Grid and the Fabric use to exchange information. Its main function is to store product and resource configurations.
Install the following packages (Tomcat & xmldb_server) on a single machine in your site. It can be either submission (or) execution (or) an independent machine. But we recommend its installation on submission site if you need output retrieval via the web.
The installation of Tomcat is optional if you have another Servlet runner. Tomcat is used as a servlet engine within SAM-Grid to run xmldb_server servlet.
|
Product |
Tomcat |
|
Install as |
Sam |
|
Install operation |
upd install tomcat -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor tomcat Notes: Defaults are fine. The product area where tomcat is installed must be owned by user “sam”. If you have installed this server as products for special reasons, change the ownership from “products” to “sam” (e.g. you have root) or you can execute “ups chown tomcat”. |
|
Start as |
Sam |
|
Start operation |
ups start tomcat |
|
Product |
xmldb_server |
|
Install as |
Sam |
|
Install operation |
upd install xmldb_server -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor xmldb_server Configuration example:
Configuration Parameters: webapps_directory: Enter the directory used by your servlet engine to store the servlets. db_location: Enter the directory used by the database to store the documents. db_name: Enter the name of the xml database; this name is used when querying the database. Use the default 'db' run_command: Enter the command that starts up your servlet engine. stop_command: Enter the command that stops your servlet engine. Notes: We have observed corruption in xmldb whenever the disk storing the DB files gets full. Only way to recover from this is clean up the database DB files and start from scratch. Users should make sure that, they consider this while deciding on the db_location. The disk requirement for the xmldb increases as we add more information with every local job running at the site. The increase in the disk space used is non linear. Hence, there is no good metrics to identify the disk required to store xmldb files. It is the responsibility of the users that the machine does not run out of disk space to avoid this problem. |
|
Start as |
Sam |
|
Start Operation |
ups run xmldb_server & Notes: YOU NEED TO RUN THE COMMAND NOW, if you plan to use this database for configuration of other products (recommended). Refer Section starting up the servers for instructions to run all servers. |
Install the following software on both submission and execution sites.
|
Product |
xmldb_client |
|
Install as |
Sam |
|
Install operation |
upd install xmldb_client -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor xmldb_client
Configuration example: <xmldb_client> <interview_schema_version version="1_0"/> <xmldb_server url="http://samgfarm4.fnal.gov:7080/Xindice"/> </xmldb_client>
Configuration Parameters: url: Enter the xml db server for your site. If this is the machine that runs the xml db server, accept the default, otherwise enter the correct address. Enter the default xml db server URL ( typical form http://my.db.host:7080/Xindice ): What is the url of the xmldb_server ? [http://samham.fnal.gov:7080/Xindice]: The attribute url is set to the 'http://samham.fnal.gov:7080/Xindice' |
|
Product |
jim_config |
|
Configure as |
Sam |
|
Configure operation |
$ ups store_constants jim_config Notes This will store the global constants like SAM IOR, broker location, DB Server name etc in the database. |
Skip to Submission Site Installation
Skip to Execution Site Installation
Site where you submit your job to the Grid. This is a very light weight component that could be installed by installing just jim_client
|
Product |
jim_client |
|
Install as |
Products |
|
Install operation |
upd install jim_client -G-c |
|
Tailor as |
Products |
|
Tailor Operation |
ups tailor jim_client [-q <qualifier if allocated by “new” command>]
Configuration example: <jim_client_configuration> <interview_schema version="1_3"/> <condor_config_parameters> <uid_domain domain="fnal.gov"/> <schedd_host hostname="samgrid.fnal.gov"/> <condor_host hostname="samgrid.fnal.gov"/> <network_interface> <public_interface ip="131.225.167.1" /> </network_interface> <structured_jobs structured_jobs="no"/> </condor_config_parameters> <MyProxy_Server hostname="fermigrid4.fnal.gov"/> </jim_client_configuration>
Configuration Parameters: uid_domain: Enter your domain schedd_host: Enter the hostname of the submission site condor_host: Enter the hostname of the jim_broker. Use default. public_interface, network_interface: Enter the IP address of your system that you want to use. In case if you have multiple network interfaces you should use the IP address of the interface that is accessible from outside your local network. To get the information of various interfaces on your system run /sbin/ifconfig in another window structured_jobs: Enter if you want to run structured jobs. Answer 'no' here MyProxy_Server: Enter the address of the MyProxy_Server. Use the default Notes: You can ignore warnings about xmldb_client: by default, the JIM configuration manager will try to store this configuration into an xml database; this is not required for jim_client and the automatic FS storage is sufficient. |
|
Crating new environment |
ups new jim_client Ups command "new" creates and declares qualifier name which can be used to tailor and store multiply accessible unique jim_client configurations. ups new jim_config will prompt for user input and will do the steps to declare new instance of the product in ups database. The newly declared product will need to be tailored the same way as its non qualifier based version outlined in the previous step. To do that, the specified qualifier name must be explicitly used. I.e. ups tailor jim_client -q <new qualifier>. The new environment will be available after setting up jim_client with “setup jim_client -q <new qualifier>” |
Congratulations. You may start submitting your job if your submission site is configured.
End of Client site Installation!
Make sure you have followed the middleware installation instructions at paragraph 3.2; in particular you need to install condor and Globus, configure GSI, request a service certificate, install the XML database and store the global SAMGrid constants into it.
|
Product |
jim_broker_client |
|
Install as |
Sam |
|
Install Operation |
upd install jim_broker_client -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
$ups tailor jim_broker_client
Configuration example: <jim_broker_client_configuration> <interview_schema version="1_6" /> <condor_config_parameters> <uid_domain domain="fnal.gov" /> <local_dir dir="/data/jim" /> <spool_dir dir="/data1/jim" /> <condor_host hostname="samgrid.fnal.gov" /> <condor_admin_email email="parag@fnal.gov" /> <network_interface ip="131.225.110.153" /> <broker_identity subject="/DC=org/DC=doegrids/OU=Services/CN=sam/samgrid.fnal.gov" /> <condor_lowport_highport lowport_highport="49152,65535" /> <site_name site_name="samgrid.fnal.gov" /> </condor_config_parameters> </jim_broker_client_configuration>
Configuration Parameters: uid_domain: Enter your domain local_dir: Enter Full path where you want to store your log files for the JIM suite. It must be a local path. A directory called jim_broker_client will be created automatically inside this local_dir when you first start the scheduler. Logs and gridmapfile pertaining to the JIM broker client installation will be stored here. User ‘sam’ should have write access to this directory spool_dir: Enter Full path where you want to store your spool files for the JIM suite. It must be a local path. A directory called jim_broker_client will be created automatically inside this spool_dir when you first start the scheduler. The spool area will be your location to store input and output sandboxes for JIM broker client. User ‘sam’ should have write access to this directory condor_host: Enter the hostname of the broker condor_admin_email: Enter the administrator email-id for this installation public_interface, network_interface: Enter the IP address of your system that you want to use. In case if you have multiple network interfaces you should use the IP address of the interface that is accessible from outside your local network. To get the information of various interfaces on your system run /sbin/ifconfig in another window broker_identity: Enter the certificate subject of the Broker. Use default. condor_lowport_highport: Enter the range of port number on which you want the condor processes to run (eg 50101, 50120). Please note that this is important if the schedd node is behind a firewall site_name: Enter Site Name. This is the name which will appear in the class Ad of the schedd and will be displayed on the web.
Notes: Define the variable SAMGRID_LOCAL_DIRECTORY, as explained in Section System Configuration, to sensible defaults.
You will be asked several questions. Choose a directory for the job spooling area: User “sam” will write on this area on behalf of the user's job the input sandbox and other files. A good location is the samgrid local area (where also the local ups dir generally is) in a directory called "jim". AFTER tailoring you'll need to chown -R this area to user sam. Don't change the defaults of the other questions if you don't absolutely know what you are doing.
Other Useful Tasks: · Generate gridmapfile from voms Refer to the Section “Automating the Maintenance Tasks”
Following tasks are not supported any more. · To add a new user to use your Submission site, execute
$ups AddUser jim_broker_client
NOTE: you can add users ONLY AFTER you successfully started jim_broker_client once.
· Optionally, if you are an advanced user and want to add multiple user at the same time, first create an input_file with list of Grid subjects and execute
$<jim_broker_client_prod_dir>ups/gridmap_gen.py <input_file >>condor_schedd_gridmap_file |
|
Run as |
Sam |
|
Run Operation |
ups run jim_broker_client & Notes: Refer Section starting up the servers for instructions to run all servers. DO NOT DO THIS STEP UNTIL YOU HAVE INSTALLED THE SAM SERVICE CERTIFICATE. |
This is an optional package for users who prefer to retrieve their output from a web page after the job is completed.
· Make sure your servlet runner is installed and configured properly in your submission site. You may optionally install our distribution of tomcat.
· Install jim_www_sandbox servlet
|
Product |
jim_www_sandbox |
|
Install as |
Sam |
|
Install operation |
upd install jim_www_sandbox -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor jim_www_sandbox
Configuration example: <jim_www_sandbox_configuration> <interview_schema version="1_1"/> <nonsecure_services url="http://samgrid.fnal.gov:7080" directory="/data/products/ups/db/tomcat/webapps"/> <secure_services url="https://samgrid.fnal.gov:7081" directory="/data/products/ups/db/tomcat/secureapps"/> <jim_out_sandbox servlet_secure="no"/> </jim_www_sandbox_configuration>
Configuration Parameters: nonsecure_services, directory: Enter the
directory used by your servlet engine to run non secure servlets. secure_services, directory: Enter the
directory used by your servlet engine to run secure servlets. Notes: You will be prompted to enter the location where you have install servlet in your machine. After tailoring this you may need to restart the servlet runner. |
· If you are using other distributions of tomcat (or) servlet runner, you may need to do the following additional step.
o Verify Broker client is configured properly and its environment is accessible to the servlet runner during startup i.e., your servlet runner should do “setup jim_broker_client” on the same terminal before start up so that it gets access to the Broker client’s environment.
End of Submission site Installation.
Skip to starting up the servers.
IMPORTANT: Before you proceed with the usual “upd install / ups tailor” routine, be sure to read, understand, and execute instructions from the document describing the grid to fabric job submission interface. This job submission is the core of the execution site installation (even though this is merely 1 or 2 packages out of 20 or so total for the execution site) and has historically caused the most questions and problems. Please do not install the rest of the execution site if the local job submission is not working!
Make sure you have followed the middleware installation instructions at paragraph 3.2; in particular you need to install Condor and Globus, configure GSI, request a service certificate, install the XML database and store the global SAMGrid constants into it.
See http://d0db.fnal.gov/sam/ for DZero and CDF instructions.
Install the latest version of SAM and declare it current. Refer to the Samgrid release cuts at http://www-d0.fnal.gov/computing/grid/releases/ Also make sure that "setup SAM -q d0_prd" (or -q cdf_prd) sets up the latest version (check for example $SAM_DIR). If this is not the case declare the previous versions "old".
We recommend that the installation of the JIM products be done on a separate ups database owned by user sam: this is the only set of products needed by the JIM software. On the other hand, generally the SAM software is installed on a product area owned by user products or cdfsoft. The JIM execution site software will need access to a few SAM products (see below): we found convenient simply to install and configure them again in the JIM ups database:
· sam client: the code and the configuration. By default, the JIM software will execute “setup SAM -q d0_prd” (or cdf_prd) to get the SAM client environment.
· sam_cp_config: needs to be configured for intra-cluster transfers. Typically jim_gridftp or fcp is used. You can add the line ‘.’ : [ ‘jim_gridftp’, ], to you domain capability map
You may optionally decide to use a durable location setup at a different/central site or you may setup a durable location on site. The durable location will be used by Samgrid jobs to store production files before they are merged and finally stored to the tapes. To setup durable location, you need to refer to Samgrid’s latest release cut at http://www-d0.fnal.gov/computing/grid/releases/ Install packages listed under “Middleware packages”, “Sam client packages”, “Sam Station packages” and jim_gridftp. If the durable location is on a machine that acts as a Samgrid head node or station node, most of these packages should already exist. If not, please refer to individual package installation and configuration. Once the installation is complete, register the location to SAM by sending an email to the SAM shifters with the name of the machine, path to the storage and the disk size of the storage. Configure the local_storage in site configuration to use the durable location. Refer to Site configuration for more details. If need to configure multiple durable locations, please refer to documentation on configuring complex site with application specific queues and storages at http://www-d0.fnal.gov/computing/grid/doc/Application-ResourceTuning-01Aug05-cut.pdf
|
As user |
Root |
|
Operations |
You need to request a host certificate to a Certificate Authority (CA) for your gateway node (typically 1 day response). SAM-Grid works mostly with the DOEGrids CA, but other CAs may be trusted as well. Contact d0sam-admin@fnal.gov or cdfsam-admin@fnal.gov for more information.
The GSI security binaries can be made available to your shell via
$setup VDT
the command to request the certificate to the DOEGrids CA is
$ GRID_SECURITY_DIR=/etc/grid-security grid-cert-request -host `hostname -f` -ca 1c3f2ca8
Follow instructions at http://www.grid.iu.edu/osg-ra/HostRequest.php, in particular you need to fill in a certificate request form. The relevant form is at https://pki1.doegrids.org/ca/ , clicking “Grid or SSL Server”.
When you are ready to fill in the form, use “Affiliation” OSG and “Experiment” DZero/CDF; you can mention in the comment that the certificate is for SAM-Grid. |
You need to configure your system with the list of users allowed to run jobs at your resources. This list is called gridmap-file, as it maps the grid subjects of the users to the local unix accounts that run the job. The SAM-Grid has developed a tool that uses sam_gridftp to get an “official” list of users belonging to CDF or DZero. Before doing the following commands, make sure your sam_gridftp is installed and working. In particular, DO THE FOLLOWING COMMANDS ONLY AFTER YOU'VE RECEIVED THE SAM SERVICE CERTIFICATE.
|
Product |
sam_gsi_config_util |
|
Install as |
Sam |
|
Install operation |
upd install sam_gsi_config_util --q VDT -G-c |
|
Run as |
Root |
|
Run Operation |
$setup sam_gsi_config_util $sam_gsi_get_gridmap --gatekeeper --local-user=<user-running-jobs> Note: this command will append the subjects of the DZero/CDF VO to your local grid-mapfile. If you have an old grid-mapfile, make sure that the mapping of the subjects to the user that runs the job is right, before using the tool: you may end up with the same subject mapped to two different users. Optional - Edit your crontab for root and add something like
0 * * * * . /usr/local/etc/setups.sh && setup sam_gsi_config_util && sam_gsi_get_gridmap --gatekeeper --no-default-gridmap –local-user=samgrid > /dev/null 2>&1
This will keep up to date your grid-mapfile. |
|
Product |
jim_job_managers |
|
Install as |
Sam |
|
Install Operation |
upd install jim_job_managers -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor jim_job_managers
Configuration example: <jim_job_managers_configuration> <interview_schema version="2_1" /> <ups setup_script="/local/ups/etc/setups.sh" /> <experiment experiment_name="d0"> <sam_prd_qualifier sam_prd_qualifier="d0_prd" /> <sam_dev_qualifier sam_dev_qualifier="d0_dev" /> </experiment> <dzero_monte_carlo> <events number_per_output_file="250" /> <accelerator bootstrap_time="180" interval="90" /> </dzero_monte_carlo> <cdf_monte_carlo> <accelerator bootstrap_time="180" interval="90" /> </cdf_monte_carlo> <dzero_merge> <accelerator bootstrap_time="180" interval="180" /> </dzero_merge> <dzero_reconstruction> <accelerator bootstrap_time="180" interval="300" /> </dzero_reconstruction> <dzero_reco_merge> <accelerator bootstrap_time="180" interval="180" /> </dzero_reco_merge> <dzero_tmbfix> <accelerator bootstrap_time="180" interval="300" /> </dzero_tmbfix> <dzero_skimming> <accelerator bootstrap_time="180" interval="300" /> </dzero_skimming> <local_tmp_area local_tmp_area="/data/jim/jim_tmp/" /> <polling_interval grid_update_interval="300" xml_update_interval="300" /> </jim_job_managers_configuration>
Configuration Parameters:
accelerator, interval: Time intervals in minutes after which the job is killed if there is no disk activity. Events,number_per_output_file: Default, maximum number of events written to MC output file. Number of events per output file can be over ridden by specifying runjob_numevts and, events_per_file in the JDL Local_tmp_area: Temporary area used by jim_job_managers to write files to. ‘sam’ should have write access to the directory. Polling_interval, grid_update_interval
Note: To configure jim_job_managers with application specific queues, please refer to the documentation available off the Samgrid home page. |
|
Product |
jim_sandbox |
|
Install as |
Sam |
|
Install Operation |
upd install jim_sandbox -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor jim_sandbox
Configuration example: <jim_sandbox_configuration home="/data/jim/jim_sandbox"> <interview_schema version="3_0" /> <keep_sandbox compressed="no" /> </jim_sandbox_configuration>
Configuration Parameters: jim_sandbox_configuration, home: Enter the directory that handles the input sandboxes at the head node. The disk space required is large (typically hundreds of GB). This directory should be writable by user ‘sam’ and the user running jobs, typically ‘samgrid’ keep_sandbox, compressed: Do you want to keep the sandbox compressed at the head node. |
|
Product |
jim_gridftp |
|
Install as |
Sam |
|
Install Operation |
upd install jim_gridftp -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor jim_gridftp
Configuration example: <jim_gridftp_configuration> <interview_schema version="2_0" /> <host name="samgfarm4.fnal.gov"> <data_server> <port number="4568" /> <certificate subject="/DC=org/DC=doegrids/OU=Services/CN=sam/samgfarm4.fnal.gov"/> </data_server> <head_server> <port number="4569" /> <certificate subject="/DC=org/DC=doegrids/OU=Services/CN=sam/samgfarm4.fnal.gov"/> </head_server> </host> </jim_gridftp_configuration>
Configuration Parameters: port: port on which data/head servers are running certificate, subject: DN of the service. Notes: This product is used to start a gridftp server at the gateway node, for gateway/worker nodes transfers, and a server where the SAM station is, for data transfer. |
|
Product |
sam_fcp |
|
Install as |
Sam |
|
Install Operation |
Upd install sam_fcp -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
Ups tailor sam_fcp
Configuration example: <sam_fcp_configuration> <interview_schema version="1_0" />
<fcp_queue name="default"> <fcp_port port="7788" /> <max_xfers transfers="15" /> <transfer_mechanism name="jim_gridftp" /> <time_out value="3600" /> </fcp_queue> <fcp_queue name="default1"> <fcp_port port="7789" /> <max_xfers transfers="15" /> <transfer_mechanism name="jim_gridftp" /> <time_out value="3600" /> </fcp_queue> </sam_fcp_configuration>
Configuration Parameters: port: port on which data/head servers are running certificate, subject: DN of the service. Notes: This product is used to control the number of concurrent transfers from the head node to the worker node. |
|
Product |
Jim_config |
|
Configure as |
Sam |
|
Configure operation |
Describe the resources at your site to the JIM software suite by answering the questions prompted by issuing $ ups configure_complex_site jim_config or $ ups configure_site jim_config (for sites with 1 cluster, 1 gatekeeper, 1
jobmanager, 1 station, Notes: The resources at a site are organized with the following hierarchy: · a site can have multiple clusters · a cluster can have multiple "gatekeepers" (grid gateways) · a gatekeeper can have multiple "jobmanagers" (grid interfaces to local resource manager interfaces) · a "jobmanager" can submit jobs to multiple SAM station (typically in different universes i.e. production or development) · a cluster can also have multiple SAM stations not accessible via the grid (sam stations that are not "under" any gatekeepers and used for local submission) · a cluster can have a durable storage, to keep intermediate processing files. This is configured by setting up local_storage tag in the configuration. · To configure a complex site with application specific queues, please refer appropriate documentation on the Samgrid homepage.
Example in xml of a site description:
<?xml version="1.0"?> <site_configuration> <site name="FNAL" /> <schema version="v1_0" /> <cluster name="SamGrid-testbed" architecture="Linux+2.4"> <gatekeeper location="samadams.fnal.gov:2119"> <jobmanager name="jobmanager-sam"> <station name="samadams" universe="dev" experiment="d0" /> <station name="samadams" universe="prd" experiment="d0" /> </jobmanager> </gatekeeper> </cluster> <local_storage path="/data/sam/disk/durable_location" node="samgfarm4.fnal.gov" /> </site_configuration> |
|
Product |
jim_advertise |
|
Install as |
Sam |
|
Install operation |
upd install jim_advertise -G-c |
|
Tailor as |
Sam |
|
Tailor Operation |
ups tailor jim_advertise
Configuration example: <jim_advertise_configuration> <interview_schema version="1_3" /> <verbose value="true" /> <log_file path="/data/jim/jim_advertise/log" /> <advertise_interval interval="180" /> <collector fqdn="samgrid.fnal.gov" /> <extra_condor_advertise_args arguments="-debug" /> <classad_generator xquery="${JIM_ADVERTISE_DIR}/bin/xml2classad_cgs.xq" /> <post_filter exe="${JIM_ADVERTISE_DIR}/bin/postFilter_cgs.sh" /> <condor_config_parameters> <network_interface ip="131.225.167.1" /> </condor_config_parameters> </jim_advertise_configuration>
Configuration Parameters: verbose, value: Do you want JIM advertise to log debug messages while executing (the log file becomes very big) log_file, path: Enter the full path of your log file (missing directories will be created automatically when you first start jim_advertise). advertise_interval: Enter the inverval in seconds for sending the classads to the collector collector, fqdn: Enter the Fully Qualified Domain Name of the collector. Use default. extra_condor_advertise_args: Enter any condor extra arguments you want to send to the collector (e.g. –tcp -debug) classad_generator: This xquery will decide how to publish resources in the form of classad from the site configuration XML file. Use defaults. post_filter: Enter the full path to post filtering script if you have one. This script will take the output of the Classad generation procedure and should produce an output in the same form of the input. The use of environment variables is allowed. Use defaults. condor_config_parameters, network_interface: Enter the IP address of the system that you want to use. If you have multiple network interfaces you should use the IP address of the interface that is accessible from outside your local network. To get the information of various interfaces on your system run /sbin/ifconfig in another window. |
|
Run as |
Sam |
|
Run Operation |
ups run jim_advertise & Notes: Refer Section starting up the servers for instructions to run all servers. |
End of Execution site Installation.
Skip to starting up the servers.
The SAM-Grid monitoring service is available on the web at http://samgrid.fnal.gov:8080
In order for a site to be monitored, there are 3 steps to follow:
1. install Globus MDS at least on one machine of the site.
2. Create site configuration.
3. configure/update MDS with the SAM-Grid schema/information hierarchy
4. inform the SAM-Grid team of the availability of the new monitoring site with following details: Host and Port where MDS is running, Jim-Site's name that was chosen by the site administrator while tailoring of jim_info_providers.
You may skip this sub-section if you have already configured the site information for advertisement framework. In case you have not done yet, follow Creating Site Configuration now.
|
Product |
jim_info_providers |
|
Install as |
Sam |
|
Install operation |
$ upd install jim_info_providers -q GCC-2.95.2 -G-c |
|
Tailor as |
Sam |
|
Tailor operation |
$ ups tailor jim_info_providers -q GCC-2.95.2 |
|
Run as |
Sam |
|
Run operation |
$ ups start jim_info_providers -q GCC-2.95.2 Notes: Ignoring warning messages at startup time is generally ok. Refer Section starting up the servers for instructions to run all servers. |
End of Execution site Installation.
You may start the servers now.
After you've installed all the components of SAM-Grid, i.e. JIM and/or SAM, install the package that runs the servers.
|
Product |
server_run |
|
Install as |
Sam |
|
Install operation |
upd install -G-c server_run |
|
Tailor as |
Sam |
|
Tailor operation |
ups tailor server_run |
|
Configure as |
Root |
|
|
Edit the apporopriate system bootstrap files in /etc/rc.XXX (/etc/rc.d/rc.local is a good choice) so that the following is effected at the system bootstrap time:
$ su SAM -c /full/path/samgrid_startup.sh
where a file samgrid_startup.sh contains something like:
#!/bin/sh source $SETUPS_DIR/setups.sh ups run server_run
If your new installation of server_run also includes the SAM server suite, you might also want to disable automatic start-up of SAM servers by the lower-level sam_bootstrap package.
Strongly Recommend If you can, reboot your machine at this time to check that server_run is started up properly upon system boot. We don't ask you to do this out of affection to the popular PC operating system! ;-). If impossible, start the servers now (see below). |
|
Run as |
Sam |
|
|
If you were running the XML DB servers during this installation (recommended), remember to stop it at this time:
$ ups stop xmldb_server
You can now run SAM-Grid (assuming you have all the certificates in place):
On a new terminal window (for a clean environment) type $ ups run server_run |
Congratulations. Your installation is complete.
|
Product |
samgrid_util, product whose configuration you want to modify. |
|
Execute as |
Sam. |
|
Execute operation |
$ setup samgrid_util $ setup <product name> $ jim_configure.sh <product name>
Example: 1. Modifying
sam_fcp configuration: 2. Modifying
site_configration: |
Newer versions of samgrid_util (v3_1_8+) package have useful script is the cron directory of the product. These scripts can be installed in crontab to automate some of the regular maintenance tasks via cronjob. This section describe different tools available and how to use them
|
Product |
samgrid_util |
|
Execute as |
samgrid on the head node or forwarding node once a day |
|
Execute operation |
$ setup samgrid_util $SAMGRID_UTIL_DIR/cron/samgrid_ce_disk_cleanup.sh
--gramDir=<Dir containing gram_job_mgr_*log and gram_scratch_*>
--gassDir=<Globus gass cache dir> --sandboxDir=<jim_sandbox dir> Clean up policy is – · Gram_scratch* : 10 days old · Gram_job_mgr*log: 3 days old · Jim_sandbox dirs: 30 days old |
|
Product |
samgrid_util |
|
Execute as |
samgrid on the forwarding node once a day |
|
Execute operation |
$ setup samgrid_util $SAMGRID_UTIL_DIR/cron/mark_osg_jobs_for_deletion.sh
--hold-days=<Number of days. Jobs older than this number will be marked
for deletion. Min value 5> This script only works on the newer forwarding nodes (samgfwd0x) and mandates that jobs that are less than 5 days old will not be cleaned up. Note: If the forwarding node has been in production for a while without this script in place, you should run the script manually for first few times before putting it in cronttab. After few days of production, number of held jobs could quickly increase beyond 100,000. Cleaning up so many jobs is an intensive process and is best handled incrementally. If this is not handled incrementally, you will notice that CondorG on the forwarding node will become unresponsive. It is highly recommended to start with a very large number for --hold-days, and gradually reducing it by a week (7 days) or 10 days based on how many jobs are there in the held state. It is best to keep the number of jobs cleaned at a time to 10,000 and adjust the number of days accordingly. To see the jobs that are in held state - $ setup samgrid_osg_client $ condor_q -constraint “JobStatus == 5” |
|
Product |
samgrid_util |
|
Execute as |
sam on the samgrid.fnal.gov node once a week |
|
Execute operation |
$ setup samgrid_util $SAMGRID_UTIL_DIR/cron/mark_samgrid_jobs_for_deletion.sh
--max-days=<Number of days. Jobs older than this number will be marked for
deletion. Min value 60> --debug=<true|false Case sensitive. Behavior
defaults to true> Note: If the queuing node has been in production for more than 7 months without this script in place, you should run the script manually for first few times before putting it in cronttab. After few days of production, number of jobs could quickly increase. Cleaning up so many jobs is an intensive process and is best handled incrementally. If this is not handled incrementally, you will notice that CondorG on the queuing node will become unresponsive. It is highly recommended to start with a very large number for --max-days, and gradually reducing it by a week (7 days). |
|
Product |
samgrid_util |
|
Execute as |
sam on the forwarding node once a day at 00:05 am |
|
Execute operation |
$ setup samgrid_util; setup vdt; setup jim_advertise; setup tomcat; $SAMGRID_UTIL_DIR/cron/samgrid_rotate_logs.sh --logrotate-workdir=/samgrid/logs/jimlogs/samgrid_log_rotate --globus-log-dir=$GLOBUS_LOCATION/var --jimadvertise-log-dir=/samgrid/logs/jimlogs/jim_advertise --tomcat-log-dir=$TOMCAT_DIR/logs It is strongly recommended to run this script at 00:05 am via cron.This script rotates samgrid jobmanager logs, globus gatekeeper and accounting logs, globus gridftp logs and jim_advertise logs and stores the old log file in a specific naming format. This log file naming format is required and used by the log archiving tool to archive old logs. |
|
Product |
samgrid_util |
|
Execute as |
sam on the forwarding node once every month |
|
Execute operation |
$ setup tomcat; setup vdt; setup jim_advertise; $SAMGRID_UTIL_DIR/cron/archive_samgrid_logs.sh --archive-dir=/samgrid/logs/logsarchives --tomcat-log-dir=$TOMCAT_DIR/logs --globus-log-dir=$GLOBUS_LOCATION/var --jimadvertise-log-dir=/samgrid/logs/jimlogs/jim_advertise This tools archives logs that are older than current month, zips them and stores them in the directory specified by --archive-dir. This tool expects that the names of the old log files follow a specific convention achieved from running samgrid_rotate_logs.sh above. |
|
Product |
samgrid_util |
|
Execute as |
sam based on the needs |
|
Execute operation |
$ setup samgrid_util $SAMGRID_UTIL_DIR/cron/archive-jim_broker_client-spool.sh --number-of-jobs=250 In this case oldest 250 job spool areas will be moved from the current location /samgrid/logs/jimlogs/jim_broker_client/spool to /samgrid/logs/jimlogs/jim_broker_client/spool/archive/spool.1/ To avoid human error maximum number of jobs is capped to 250. If you need to move more dirs at a time, you can run the tool again. A better solution would be to have two queueing nodes. One for Monte Carlo and other for Reconstruction to make the infrastructure more scalable. |
|
Product |
sam_gsi_config, jim_broker_client |
|
Execute as |
sam on the queuing node one to two times a day |
|
Execute operation |
$ setup sam_gsi_config –q vdt; setup jim_broker_client; generate_submission_site_gridmap
Requires sam_gsi_config v2_3_5 -q vdt or higher |
|
Product |
Vdt, sam_gsi_config |
|
Execute as |
Root for vdt related command. Affects all vdt
installation |
|
Execute operation |
Root related commands: root$ setup vdt root$ vdt-control --on vdt-update-certs root$ vdt-control --on fetch-crl
sam_gsi_config related commands: Put a crontab to run following commands daily as user sam: setup sam_gsi_config -q vdt; sam_gsi_install_ca --force-copy
Requires sam_gsi_config v2_3_5 -q vdt or higher |
|
Product |
jim_client |
|
Execute as |
Anyone with proper credentials. |
|
Execute operation |
samg submit <myjob.jdf> [condor_submit args] Notes [condor_submit args] are the arguments passed directly to the condor_submit command. e.g “–r schedd.machine.fqdn”. |
The steps to submit a job are
1. User creates a job description file (MyJob.jdf) into his/her writable working directory:
A typical SAMAnalsyis job is given below
------------------------------ ------------------------------
sam_dataset = ab2files
station_name = sammy
executable = /home/murthi/testbed/samanalysis/retrieve.sh
job_manager = sam
job_type = sam_analysis
sam_universe = dev
sam_experiment = d0
output = /tmp/murthi/hello_sammy.output
error = /tmp/murthi/hello_sammy.error
cpu-per-event = 1s
group = grid
instances = 1
Globusscheduler = $$(gatekeeper_url_)
------------------------------ ------------------------------
You can download an example of a test script from here.
2. By entering 'samg submit MyJob.jdf' the job is submitted to the SAM-Grid for execution.
|
$samg submit samanalysis.samgjdf_sammy
Checking Grid credentials... Ok.
Job(s) submitted successfully.
Global JID = murthi_samadams.fnal.gov_130922_17513 |
You will get the Global Job Id which can be used for reference when monitoring the job. For each instance of a job, output and error files are generated. In addition to this you can specify a log file that will keep track of the job during submission. These files are important especially for troubleshooting. Optionally, you can also submit your job to a specific scheduler listed on your collector by invoking the command below.
samg submit <samg_jdf> -f schedd.machine.fqdn
3. You may optionally check the jobs status on the queue by
$ samg list jobs murthi_samadams.fnal.gov_130922_17513_0
The job can be referenced from the monitoring site by its Global Job ID.
In order to submit a job to SAM-Grid, you need to create a job description file (jdf). The jdf can contain a number of attributes from which some are required.
The syntax that is required for the jdf is case-sensitive. The order of attributes is not required. However, when running job instances the "instances" attribute should be located after its attributes have been defined. In case of multiple instances, only the attributes that are changed should be written again in the jdf. An example of is shown above.
They are grouped according to the way the 'samg submit' handles them. It is possible to have different types of jobs. "sam_analysis" jobs are brokered to a SAM-Grid resource. The job type "caf" uses caf resources. Also there is some ongoing work to support "monte_carlo" jobs.
The required attributes for sam_analysis jobs are "sam_dataset", "sam_universe", "sam_experiment", "executable", "cpu-per-event", and "instances".
More information on troubleshooting can be found at http://www-d0.fnal.gov/computing/grid/JIM-FAQ.htm.
These are the specifications for the SAM-Grid job description language.
The job description language distinguishes the different job types (sam_analysis, vanilla, caf, mc_runjob, etc). In the specifications presented below, a section for each type is provided. In addition some extensions that are not yet fully productized are listed. These extensions are therefore not recommended for usage.
Note for advanced users
“samg submit” converts JDL of a supported job type to Condor JDL with some exceptions.
· Attributes with “+” prefixed overrides the auto generation of these attributes by “samg submit”. These attributes gets printed on the classads of the generated Condor JDL with no “+” prefixed.
· Attributes with “++” prefixed does not override the auto generation of attributes, instead they are printed on the generated Condor JDL with just one “+” prefixed.
The specifications listed below apply for all types of jobs.
job_type = <keyword>
job_type refers to a unique keyword that denotes a specific job type. Valid keywords are
montecarlo, mc_runjob (deprecated), merge, structured, samanalysis and caf.
instances = 1
Currently multiple instances of jobs are not supported.
Some of the attributes listed below may be a required attribute for some job types (or) some default value might be required if declared. Please refer to the job specific JDL for any final saying on how it should be used.
input_sandbox = <directory>
Specifies a directory that serves as the input sandbox. This directory will be shipped to the execution site. The sandbox must contain the executable as well as other files needed for the job.
input_sandbox_tgz = <pathname to a tar.gz file>
Specifies a “tar.gz” file that serves as the input sandbox. This bundle (input_sandbox_tgz) will be shipped to the execution site. The sandbox must contain the executable as well as the other files needed for the job.
log = <pathname>
Log for grid specific information esp. useful for debugging. Note: This is not the user job's log.
input = <pathname>
Any standard input the job needs while running. The pathname refers to a local file name.
output = <pathname>
error = <pathname>
These refer to the local files where the job’s standard output and error will be shipped back from the execution site. This feature is not completely functional. The output & error doesn’t reach the client’s side. But can be extracted from the submission site.
jobmanager_name = <keyword>
jobmanager_name refers to the job manager used at the execution site, e.g. sam
Globusscheduler = <scheduler-name>
Specifies the Globus resource to which the job should be submitted. The default is the matched resource.
station_name = <stationname>
The station name at which the job will be executed, assuming that the requirements are satisfied. If user does not define the station name, brokering will determine it from the matching station. However, station name may be declared if user prefers a certain station.
requirements = <Boolean expression>
The expression must evaluate to true on the matching machine. The requirements specified by the user get appended to the default requirements generated by the jim_client.
arguments = <executable_args>
Parameters to be passed to the executable. The parameters must NOT be enclosed into double quotes (e.g. arguments = arg1 arg2 arg3)
grid_resource_requirements_string = <Resource Contact | Constraints expressed in GlueSchemaFormat>
Example:
Submitting job to a specific resource:
- grid_resource_requirement_string = cmsosgce.fnal.gov:2119/jobmanager-condor
- grid_resource_requirement_string = (TARGET.GlueCEInfoHostName =?= "stitch.oscer.ou.edu")
Submitting job using OSG ReSS to a resource matching constraints:
- grid_resource_requirement_string = (stringlistimember("VO:dzero", TARGET.GlueCEAccessControlBaseRule, ",") && stringlistimember("OSG-0.4.1", TARGET.GlueHostApplicationSoftwareRunTimeEnvironment, ","))
- grid_resource_requirement_string = (GlueCEInfoContactString == "red.unl.edu:2119/jobmanager-pbs") || (GlueCEInfoContactString == "cmsosgce.fnal.gov:2119/jobmanager-condor") || (GlueCEInfoContactString == "grid1.oscer.ou.edu:2119/jobmanager-lsf") || (GlueCEInfoContactString == "osg-gw-2.t2.ucsd.edu:2119/jobmanager-condor")
An example can be found at <jim_client_product_dir>/demo_examples/release/samanalysis.samgjdf.
cpu-per-event = <value>
The estimated CPU time used per event. s|m|h (seconds|minutes|hours)
sam_dataset = <definition_name>
Name of the dataset definition to be used in the job. The dataset definition must be predefined.
sam_universe = <dev | prd>
Specifies the universe for the job. This is required to match with the resource.
sam_experiment = <D0 | CDF>
Specifies the experiment for the job. This is also required to match with the resource.
group = <groupname>
The group name to which the job belong.
extra_sam_submit_args = <sam submit arguments>
These are arguments of form (--name1=value1 --name2=value2). These arguments are appended to the generated default arguments for “sam submit”. Quotes are not generally recommended.
An example can be found at <jim_client_product_dir>/demo_examples/release/caf.samgjdf
input_sandbox_tgz = <tgz_file>
The user directory, already targz'ed. Note that the '_tgz' in the name is needed to distinguish it from the sam_analysis-type attribute 'input_sandbox', which is a local dir to be tgz'ed.
caf_initial_section = <integer value>
The number of the initial section.
sam_dataset = <definition_name>
Name of the dataset definition to be used in the job.
caf_user_name = <user_name>
The default is <user_name> on the client host.
email = <email_address>
The user email address. The default is <user_name>@<hostname> on the client host.
output_sandbox = <output_location>
The location to which the output of the job is sent. The default is <user_name>@<hostname>:~<user_name>/<sam_gid>.tgz, where sam_gid is the global job id assigned by the SAM-Grid to the job)
caf_job_type = <caf_job_type>
The default is sam.
caf_final_section = <integer value>
The number of the final section. The default is the initial_section, i.e. 1 section only.
sam_universe = <dev | prd>
Specifies the universe for the job. This is required to match with the resource. The default is prd.
sam_experiment = <d0 | cdf>
Specifies the experiment that the job is dedicated to. This is also required to match with the resource. The default is cdf.
extra_caf_submit_args = <caf submit arguments>
These are arguments of form (-name1=value1 -name2=value2). These arguments are appended to the generated default arguments for “sam submit”. Quotes are not generally recommended.
NOTE: Please note that for storing the files generated in the Monte Carlo run back to Fermi Lab you need to be a member of mc99 data group. You can verify this information at http://d0db.fnal.gov/sam_admin/cgi/autoRegister.py
runjob_requestid = <monte carlo request number>
The request number which has its details present in the request database. For more information please see http://www-d0.fnal.gov/computing/mcprod/mcc.html
runjob_numevts = <Number of events to produce for the Request Id>
The number of events to be produced for the Request Id (runjob_requestid).
d0_release_version = <d0 code version>
The version of d0 code that is to be used for producing events for runjob_requestid. The d0 code version should be consistent with the version specified in the jobfiles_dataset (explained below).
jobfiles_dataset = <dataset (snapshot) containing the tar balls>
The jobfiles_dataset is the dataset (snapshot) containing the files that are necessary for executing the request. This dataset typically contains but is not limited to, d0 code tree (e.g. d0_p14.03.02.tar.gz), Magnetic Field files (e.g. MagField_v00-01-00.tar.gz) if required, card files (e.g. cardFile_v00-07-00.tar.gz) & mc_runjob code tree (e.g. mc_runjob_v06-02-02-jim-04.tar.gz).
phase_datase_intervals = <comma separated list of event intervals >
The phase_dataset_inervals is the intervals of events you want to process or recovery.
Example: phase_dataset_intervals = 1-250,501-1000,1251-2000
minbias_dataset = <dataset containing minimum bias events to be overlaid>
The files containing minimum bias events that are to be overlaid for in the digitization phase are specified in this dataset.
phase_dataset = <dataset containing the input for a phase in the Monte Carlo chain>
If the request takes the input for a particular phase (typically it’s the generation phase) from SAM, then the dataset containing the input is specified through this attribute. During submission consistency checks are made to determine if the dataset specified by the phase_dataset attribute matches the dataset specified in the request details.
phase_skip_num_events = <number of events to skip from the input of the phase_dataset>
This directive configures mc_runjob to skip the number of events specified before reading the input to the phase. This option is useful in particular for error recovery. If some jobs fail, this option allows the user to run the jobs again, reading their expected input event range. The event range of a job that failed can be computed as <job submission index> * <events_per_file> + < phase_skip_num_events> (the last term is usually 0 for the first submission).
check_consistency = <Boolean value>
This attribute controls the level of consistency checks that are made during the grid job submission. The default behavior is that of true (all checks are made). A value of false results in some checks (e.g. d0 code version check ) to be skipped. Mandatory checks (e.g. If input is from SAM) are still done.
events_per_file = <number of events per output file>
This attribute states the number of events that are to be produced per output file (or phase).e.g. events_per_file=250 then a Grid job of 25,000 events will generate 100 files (for each Monte Carlo phase) containing 250 events in each file. If unspecified, the number of events per output file will depend on the execution site at which the grid job executes.
d0_release_version = <d0 code version>
The version of d0 code that is to be used for merging files (typically thumbnails). The d0 code version should be consistent with the version specified in the jobfiles_dataset (explained below).
jobfiles_dataset = <dataset (snapshot) containing the tar balls>
The jobfiles_dataset is the dataset (snapshot) containing the files that are necessary for executing the merging job. This dataset typically contains but is not limited to, d0 code tree (e.g. d0_p14.03.02.tar.gz) and the mc_runjob code tree (e.g. mc_runjob_v06-02-02-jim-04.tar.gz). Other optional files like card files and Magnetic field files are not required to execute merging jobs. If they are present in the dataset, they will not affect the outcome of merging jobs.
The following attributes are mutually exclusive but at least one of them has to be present to submit a merge job.
merge_dataset_name=<dataset (snapshot) containing the files to be merged>
The dataset contains files to be merged (typically thumbnails) and is mutually exclusive with merge_dimension_querry.
merge_dimension_query=<dimension query specifying the constraints for identifying files to be merged >
This is the standard dimension query accepted by SAM. Do not specify the query in double quotes.
check_consistency = <Boolean value>
This attribute controls the level of consistency checks that are made during the grid job submission. The default behavior is that of true (all checks are made). A value of false results in some checks (e.g. d0 code version check) to be skipped. Mandatory checks (e.g. If input is from SAM) are still performed
job_structure=<job type1, job type 2 …job type n>
This attribute specifies in what and how various valid samg jobs are to be executed. For example, you can combine a montecarlo job with a merge job as using following -
job_structure= montecarlo, merge
Please note that a montecarlo type job is a parent of a merge type job, i.e. after a montecarlo job is executed only the results of this particular grid job are operated upon by the merge type job.
Please send your suggestions (or) comments to Parag Mhashilkar and Gabriele Garzoglio.
Last updated on Wednesday, October 14, 2009.