SAM-Grid User and Administrator Manual

Contents

1        How to read this manual 2

2        Introduction. 3

2.1         Overview of the SAM-Grid architecture. 3

3        Installation of the SAM-Grid. 4

3.1         System requirements. 4

3.1.1     Hardware. 4

3.1.2     Software. 4

3.1.3     System Configuration. 5

3.1.4     Summary of the activities as root 5

3.1.4.1       Setup Group Accounts. 5

3.1.4.2       Open Ports for Incoming TCP connections. 5

3.1.4.3       Enable Automatic Restart of SAMGrid servers at Boot Time. 7

3.1.4.4       Setup the /etc/grid-security and xinetd daemon (Execution Site Only) 7

3.1.5     Packages and Samgrid Production Release Cuts. 7

3.2         Middleware Installation. 7

3.2.1     Installing and Configuring Condor and Globus. 7

3.2.2     Installing and Configuring the Grid Security Infrastructure. 8

3.2.3     Updating the Grid Security Infrastructure. 9

3.2.4     Get a Service Certificate. 9

3.2.5     Installing XMLDB.. 10

3.2.6     Store the SAM Grid Global Constants to the XML database. 12

3.3         Client Site Installation. 12

3.4         Submission Site Installation. 14

3.4.1     General configuration. 14

3.4.2     Installation of the JIM Broker client 14

3.4.3     Installing Output retrieval via web. 16

3.5         Execution Site Installation. 17

3.5.1     General configuration. 17

3.5.2     Install sam.. 17

3.5.3     Setting up durable location (Optional) 17

3.5.4     Get a host certificate. 18

3.5.5     Get the list of users authorized to use the resources (gridmap-file) 18

3.5.6     Install SAM-Grid Globus job-managers and sandboxing mechanisms. 19

3.5.7     Creating the Resource Description. 22

3.5.8     Installing the resource advertisement software. 23

3.6         Monitoring Site Installation. 24

3.6.1     Create site Configuration. 24

3.6.2     Configure/Update MDS. 24

4        Starting the Servers. 25

5        Modifying the Product Configuration. 26

6        Automating the Maintenance Tasks. 26

6.1         Regular Cleanup and Maintenance Tools. 26

6.1.1     Cleaning up old Globus files and jim sandboxes. 26

6.1.2     Cleaning up CondorG queue for OSG jobs. 27

6.1.3     Cleaning up CondorG queue for Samgrid jobs. 28

6.1.4     Rotate log files daily and archive them Monthly. 28

6.1.5     Relocate condor job spool directories for jim_broker_client 29

6.2         Automate security setup tasks. 29

6.2.1     Generate gridmapfile for jim_broker_client from the DZero member list in voms. 29

6.2.2     Automatically fetch the latest CA certificate files and update samgrid ca files. 30

7        Quick-Start 30

7.1         Job Submission. 30

7.1.1     A typical SAM Analysis Job submission. 30

7.1.2     Job Description File. 31

7.1.2.1       Attributes. 32

8        FAQ.. 32

9        Appendix A: The SAMGrid JDL. 32

9.1.1     Common JDL Specifications. 32

9.1.1.1       Required attributes. 32

9.1.1.2       Optional attributes. 33

9.1.2     SAM Analysis JDL Specifications. 34

9.1.2.1       Required attributes. 34

9.1.2.2       Optional attributes. 34

9.1.3     CAF JDL Specifications. 35

9.1.3.1       Required Attributes. 35

9.1.3.2       Optional attributes. 35

9.1.4     Monte Carlo JDL Specifications. 35

9.1.4.1       Required Attributes. 36

9.1.4.2       Optional attributes. 36

9.1.5     Merge Job JDL Specifications. 37

9.1.5.1       Required Attributes. 37

9.1.5.2       Mutually Exclusive attributes. 37

9.1.5.3       Optional attributes. 37

9.1.6     Structured Job JDL Specifications. 38

9.1.6.1       Required Attributes. 38

10     Suggestions. 38

1            How to read this manual

The manual has to be read sequentially. As you read, there will be pointers that will guide you to perform site specific installation e.g. "skip to submission site installation". If these pointers match your desired installation you may follow the pointer and then again you need to follow sequentially till the manual marks end of the site specific installation.

2            Introduction

SAM-Grid is a virtual project whose core is the D0-PPDG group at Fermilab and which includes off-site D0 collaborators under the aegis of various Grid projects. It's mission is to enable fully distributed computing for D0 and CDF, by:

·          Enhancing SAM as the distributed data handling system of the experiments.

·          Incorporating standard Grid tools and protocols.

·          Developing new solutions for Grid computing together with Computer Scientists.

Under this mission, the project strives to unite the D0 efforts from the multifarious Grid activities (PPDG, EU DataGrid, GridPP and more), off-site analysis work and other aspirations distributed throughout the D0 collaboration. The two main areas of work are Job Handling (including specification, brokering, scheduling etc.) and Monitoring and Information Services.

2.1           Overview of the SAM-Grid architecture

The SAM-Grid is a software suite that addresses the globally distributed computing needs of the Run II experiments at Fermilab. The Job and Information Management (JIM) components complement the Data Handling system of the experiments (SAM), providing the user with transparent remote job submission, data processing and status monitoring.

 

The logical entities of the SAM-Grid consist of

1.      Multiple Execution Sites

2.      A central Resource Selector[1]

3.      Multiple Job Submission Sites

4.      Multiple Clients (User Interface) to the Job Submission Sites.

 

Servers at the Job Submission Sites and at the Execution Sites register with the Resource Selector. Users describe and submit jobs to the Submission Sites via a User Interface, ultimately installed on a laptop. The Submission Sites maintain a spool of jobs that are periodically matched with the available resources. Matches are currently ranked by the Resource Selector according to the number of files of interest to the job that are already present at the Execution Site. Submission Sites are then responsible to reliably dispatch the job to the Execution Site. Typically, Submission Sites will also spool job outputs.

 

Typical resources at the execution site consist of

1.      A Local Resource Management system

2.      A SAM Station

3.      An Information Manager

 

The Local Resource Management system generally has experiment specific interfaces[2] and is based on a Batch System; it is responsible to receive and process jobs from the Submission Site. The SAM Station is a collection of resources managed by a set of services to satisfy Data Handling requests from individual jobs or other entities, like the Information System or the Resource Selector. It generally manages a pool of disk caches and may be interfaced to a local Mass Storage System. SAM Stations rely on a set of supporting services, some of which are distributed some are central. The Information Manager provides service configuration support and monitoring of status information. Each Site advertises resource availability to the Resource Selector.

3            Installation of the SAM-Grid

A site can join the SAM-Grid in four ways:

 

NOTE: Make sure to follow the instructions printed out at installation time.

DISCLAIMER: installing any of the JIM packages will drive you through the installation of Globus: the installation will be MUCH easier if the product area is NOT NFS shared. However, below you will find instructions on how to install Globus in this scenario as well.

 

Since the current focus of the SAM-Grid development is enabling distributed SAM analysis jobs, the discussion below assumes the site runs a SAM station. Please, refer to http://d0db.fnal.gov/sam/doc/install/ for instructions tailored to the DZero environment, http://cdfdb.fnal.gov/sam/doc/cdf/install/install.html to CDF.

3.1           System requirements

3.1.1                Hardware

The requirements will vary depending on configuration and custom installation choices.

 

Memory

128 MB of RAM  (256Mb recommended)

Hard Disk 

1 GB  (recommended)

Processor

Intel x86 processor (Pentium II (or) above recommended)

3.1.2                Software

 

Linux 

> = 2.4  kernel (RedHat (or) SUSE recommended)

UPS/UPD

>= 4.7

The packaging tool used for the SAM Grid is ups/upd.

The installation of Globus will not work if you use an earlier version.

If you need to install ups/upd, please go to http://www.fnal.gov/docs/products/ups/ .

If ups/upd is installed on your system already, generally you have to source a setup file: /usr/local/etc/setups.(c)sh for typical installation and DZero, ~cdfsoft/cdf2.(c)shrc for CDF.

3.1.3                System Configuration

·        Create a local ups product area, where all the SAM-Grid products will be installed. We strongly recommend that this area is owned by user sam: see ftp://ftp.fnal.gov/products/bootstrap/current/index.html#unix_user to create such a product area.

·        Create a local user called sam. Optionally, create a user called samgrid to enable generic authorized grid users to run jobs (this is optional, since users can be mapped to individual accounts, but highly recommended).

·        Create a directory writable by user sam, named e.g. "jim". Initialize the environment variable SAMGRID_LOCAL_DIRECTORY to point to it. This is optional but will make installation easier. This is the area used by SAMGrid products during runtime to do their activities, including sandboxing

3.1.4                Summary of the activities as root

 

In order to install the whole JIM software suite, root access is needed for the following actions:

 

3.1.4.1          Setup Group Accounts

 

SAMGrid’s servers typically run under, and use files belonging to, the “sam” UNIX account. Thus, an absolute minimum requirement is to have the “sam” account setup. Whereas it is possible to run the SAMGrid servers under another account, doing so will greatly complicate our support.

 

In the past, the SAM team also recommended the “products” account for use by the UPS/UPD system. This account exists on nearly all the FNAL systems. For our purposes, we realize that, outside of FNAL, UPS/UPD is installed solely for computing with SAM and therefore a separate user for merely owning the products files is hardly necessary. Moreover, the distinction between SAM and products creates numerous problems with permissions as our servers (especially third-party software) often write files at run-time that belong to “products” unless specifically changed.  We therefore strongly recommend installing and maintaining products as user “sam”.

 

For an execution site, depending on your local policies, you need to give authorization for off-site (relative to your site, not FNAL) users to execute jobs (please note that, by definition, this is required for your site to be part of the Grid). You may choose to map external authorized users to local “sam” account (which potentially might interfere with the SAMGrid server operation) or another group account such as “samgrid”.

3.1.4.2          Open Ports for Incoming TCP connections

 

Opening ports in the firewall from the head node (NB: SAMGrid does NOT require direct connectivity between worker nodes and the Internet):

 

grid gatekeeper:   (execuition site only) 2119 Open to all Submission Sites. See the Section on the Architecture for definition and http://samgrid.fnal.gov:8080/ for the list of the currently known submission sites. “Open to the world” would enable us to add new submission sites without changing the configuration of all the execution sites.

job-managers:       (execution site only) Any contiguous range of N ports also open to the Submission Sites where N is the number of concurrently running Grid jobs (A Grid job is “running” if it has been submitted to your local batch system). We recommend a number on the order of 100. Same consideration as above for “open to the world”. In order to have the gatekeeper use this port range, it needs to be started (e.g. via xinetd) with the environment variable GLOBUS_TCP_PORT_RANGE = 50001,50100 (example)

condor_schedd:     (submission site only) any contiguous range of M ports, where M is the maximum number of Grid jobs currently submitted through your site. Open to all Client machines authorized to use your submission site. (If all the authorized client machines are behind the same firewall, you do not need to open any of these ports.) Add to the $CONDOR_CONFIG file of jim_broker_client the macro HIGHPORT=port1 and LOWPORT=port200.

grid MDS:             (monitoring site only) 2135 Open to samgrid.fnal.gov, better to FNAL to enable possible fall over mechanisms.

tomcat:                  (all site suites, but client) 7080 open to samgrid.fnal.gov, enables configuration management via the XML Database and job’s output retrieval by the users.
(submission site only) 7081 GSI-secured door (optional), open to samgrid.fnal.gov and all the client machines which will provide for the secure job cancellation by the users.

 

If the site runs a SAM station, these are the ports that needs to be opened:

 

sam:                       4550-4555 Open to FNAL. This is required for CORBA callbacks by SAM servers. At absolute minimum, the list should include d0mino.fnal.gov (or any other D0 FNAL data router station) and d0db[-dev].fnal.gov for D0 and cdfdb.fnal.gov for CDF. Use option
--OAport=portNum to define on what port a given SAM server is listening.

sam_dcache_cp:   (CDF only) 25126 and 2811 Mainly to cdfdca.fnal.gov (for access to the CDF DCache system). D0 dcache systems to come soon. See also sam_gridftp client.

sam_gridftp server: 4567 (control) + any contiguous range of K ports (data) open to all sites to which will be allowed to pull data out of your site. NB: These should include the headnodes of all the SAM stations if you want to be considered part of the Grid!

sm_gridftp client: Any contiguous range of K ports for data, where K is the number of simultaneous transfer streams initiated by your site, must be open to all sites where your site will pull/push data (at a minimum, d0mino.fnal.gov for D0). This number must also match the number of parallel transfers set in the external SAM stager.

sam_bbftp server: (deprecated by grid_ftp). Open 14021 as described under the sam_gridftp server.

sam_bbftp client:  (deprecated by grid_ftp) All ports must be open to d0mino.fnal.gov (D0) and other sites where your site will push data.

 

 

More information on the requirements posed on firewalls by the Globus Toolkit at http://www.Globus.org/security/v2.0/firwalls.html

 

3.1.4.3          Enable Automatic Restart of SAMGrid servers at Boot Time

 

Exact means for this vary and depend on the local administrator’s preferences. A typical way is to modify the /etc/rc.local so that it includes a line similar to this:

 

su SAM –c /home/sam/samgrid_start.sh

 

Also see the Section on server start-up.

 

3.1.4.4          Setup the /etc/grid-security and xinetd daemon (Execution Site Only)

 

See Sections on configuring GSI and installing Globus gatekeepers (a.k.a. resource manager bundle).

3.1.5                Packages and Samgrid Production Release Cuts

The requirements of other packages are driven by the type of configuration you choose and are listed on their respective sections.  For each type of installation we have laid out the list of packages below.

You can find the latest Samgrid production cut at http://www-d0.fnal.gov/computing/grid/releases/

3.2           Middleware Installation

This refers to the general installation procedures required for by all the Site installation, unless specified.

3.2.1                Installing and Configuring Condor and Globus

The SAM-Grid uses the Condor and Globus middleware distributed by the Virtual Data Toolkit. The VDT product in ups is a wrapper around pacman: the software comes from the official VDT web site.

It is important that there is no variable in the environment that points to other instances of Globus while installing this new instance. You can check e.g. if GLOBUS_LOCATION or GPT_LOCATION are already defined or that PATH includes paths to other installations of Globus. In that case, check e.g. ~/.shrc and /etc/profile (or similar environment bootstrapping files) to eliminate such definitions during the installation phase.

 

Product

VDT

Install as

Sam

Install operation

upd install VDT -G-c

Tailor as

Sam or Root (see below)

Tailor Operation

Before tailoring make sure that your system have

1.      the “patch” command

2.      “gcc” (appropriate version for you Linux distribution)

3.      a ‘recent’ version of tar: v1.13.12 or newer.

More info at http://www.cs.wisc.edu/VDT/

 

as user sam:

$ ups tailor VDT

 

as user root:

$ ups InstallAsRoot VDT

 

Notes:

·          Because tailoring is CPU and I/O intensive, beware that

1.      On some systems this command can take 30 min.

2.      Installations on NFS mounted disk can give I/O related problems

·          the script executed as root changes the xinetd config files and restarts the xinetd daemon.

·          At the end of the installation, the location of the installation log will be printed out. Look at it for potential problems.

 

Notes for experts:

·          to change the default location of the gatekeeper gass cache, add this line to the xinetd configuration file the line
env = GLOBUS_GASS_CACHE_DEFAULT=/path/to/new/location

·          to let the gatekeeper know what ports are open in your firewall to run the job-managers, add something like this line to the xinetd configuration file:

env = GLOBUS_TCP_PORT_RANGE = 50001,50100

 

 

3.2.2                Installing and Configuring the Grid Security Infrastructure

This product configures the Globus Security Infrastructure of your system.

 

Product

sam_gsi_config

Install as

Sam

Install operation

upd install sam_gsi_config –q VDT -G-c

Tailor as

Sam or Root (see below)

Tailor Operation

$ups tailor sam_gsi_config –q VDT

The tailoring procedure configures GSI for various SAM-Grid products. You will be asked for what products you want to install GSI. If you do not know, configure it for all of them.

The script will print out what user(s) need to execute the command below. Typically, you need to execute it as user SAM and as root (for execution site installation):

$ups install_ca sam_gsi_config –q VDT

 

 

If you are installing either Client site (or) Monitoring site, please skip to the site specific installation. Otherwise, Submission site & execution site installers read further.

Skip Client Site Installation

Skip to Monitoring Site Installation

3.2.3                Updating the Grid Security Infrastructure

This paragraph describes what to do when a CA certificate has expired and needs to be replaced. It assumes a working sam_gsi_config installation. Also, you must know the fingerprint string of the expired CA.

 

Product

sam_gsi_config

Update as

products and/or SAM and/or root (see later)

Update Operation

If your sam_gsi_config installation is older than v2_0_8, first do

$ups update_config sam_gsi_config –q VDT

Update a CA certificate as:

$setup sam_gsi_config –q VDT

$sam_gsi_install_ca --fingerprint=<fingerprint_hash>

Where fingerprint_hash is a string of the form e1fce4e9

 

Instructions on what other users should execute this command will be printed on the screen. To force the installation as a user different from the one recommended by sam_gsi_config, add the option --force-user

 

3.2.4                Get a Service Certificate

Request a SAM service certificate to the DOEGrids CA. If you want to use a CA other than DOEGrids, this may be fine: please send email to cdfsam-admin@fnal.gov or d0sam-admin@fnal.gov.

If you are installing an execution site, you will also need to get a host certificate: you may want to get it now. Follow instructions at Get a host certificate.

 

As user

Sam

Operations

$ setup sam_gsi_config -q VDT

 

$ sam_cert_request

Follow instructions on the screen.

Notes:

The command above will drive you through the request of a SAM service certificate (typically 1 day response). When you receive by email your signed certificate, save it as is in the location printed on the screen and make it owned by user “sam”.

 

More detailed instructions for the installation of a SAM service certificate for sam_gridftp at

http://d0db.fnal.gov/sam/doc/install/fileTransfer.shtml#sam_gridftp

3.2.5                Installing XMLDB

This is an xml database server. It is currently implemented using the Xindice database and is used within the SAM-Grid as the interface that the Grid and the Fabric use to exchange information. Its main function is to store product and resource configurations.

Install the following packages (Tomcat & xmldb_server) on a single machine in your site. It can be either submission (or) execution (or) an independent machine. But we recommend its installation on submission site if you need output retrieval via the web.

The installation of Tomcat is optional if you have another Servlet runner. Tomcat is used as a servlet engine within SAM-Grid to run xmldb_server servlet.

Product

Tomcat

Install as

Sam

Install operation

upd install tomcat -G-c

Tailor as

Sam

Tailor Operation

ups tailor tomcat

Notes:

Defaults are fine.

The product area where tomcat is installed must be owned by user “sam”. If you have installed this server as products for special reasons, change the ownership from “products” to “sam” (e.g. you have root) or you can execute “ups chown tomcat”.

Start as

Sam

Start operation

ups start tomcat

 

 

Product

xmldb_server

Install as

Sam

Install operation

upd install xmldb_server -G-c

Tailor as

Sam

Tailor Operation

ups tailor xmldb_server

Configuration example:
<?xml version="1.0"?>
<interview_schema version="1_0" />
<xmldb_server
    db_name="db"
    webapps_directory="/local/ups/db/tomcat/webapps"
     db_location="/data/jim/xmldb_server/db"
     run_command="ups run tomcat"
     stop_command="ups stop tomcat">
</xmldb_server>

 

Configuration Parameters:

webapps_directory: Enter the directory used by your servlet engine to store the servlets.

db_location: Enter the directory used by the database to store the documents.

db_name: Enter the name of the xml database; this name is used when querying the database. Use the default 'db'

run_command: Enter the command that starts up your servlet engine.

stop_command: Enter the command that stops your servlet engine.

Notes:
Refer to Section System Configuration to get sensible defaults while tailoring. You have to decide where to store the xml documents of the db. This area must be writable by sam.

We have observed corruption in xmldb whenever the disk storing the DB files gets full. Only way to recover from this is clean up the database DB files and start from scratch. Users should make sure that, they consider this while deciding on the db_location. The disk requirement for the xmldb increases as we add more information with every local job running at the site. The increase in the disk space used is non linear. Hence, there is no good metrics to identify the disk required to store xmldb files. It is the responsibility of the users that the machine does not run out of disk space to avoid this problem.

Start  as

Sam

Start Operation

ups run xmldb_server &

Notes:

YOU NEED TO RUN THE COMMAND NOW, if you plan to use this database for configuration of other products (recommended). Refer Section starting up the servers for instructions to run all servers.

 

Install the following software on both submission and execution sites.

 

Product

xmldb_client

Install as

Sam

Install operation

upd install xmldb_client -G-c

Tailor as

Sam

Tailor Operation

ups tailor xmldb_client

 

Configuration example:
<?xml version="1.0" encoding="UTF-8"?>

<xmldb_client>

  <interview_schema_version version="1_0"/>

  <xmldb_server url="http://samgfarm4.fnal.gov:7080/Xindice"/>

</xmldb_client>

 

Configuration Parameters:

url: Enter the xml db server for your site. If this is the machine that runs the xml db server, accept the default, otherwise enter the correct address.

Enter the default xml db server URL ( typical form http://my.db.host:7080/Xindice ):

What is the url of the xmldb_server ? [http://samham.fnal.gov:7080/Xindice]:

    The attribute url is set to the 'http://samham.fnal.gov:7080/Xindice'

3.2.6                Store the SAM Grid Global Constants to the XML database

Product

jim_config

Configure as

Sam

Configure operation

$ ups store_constants jim_config

Notes

This will store the global constants like SAM IOR, broker location, DB Server name etc in the database.

Skip to Submission Site Installation

Skip to Execution Site Installation

3.3           Client Site Installation

Site where you submit your job to the Grid. This is a very light weight component that could be installed by installing just jim_client

 

Product

jim_client

Install as

Products

Install operation

upd install jim_client -G-c

Tailor as

Products

Tailor Operation

ups tailor jim_client [-q <qualifier if allocated by “new” command>]

 

Configuration example:
<?xml version="1.0" encoding="UTF-8"?>

<jim_client_configuration>

  <interview_schema version="1_3"/>

  <condor_config_parameters>

    <uid_domain domain="fnal.gov"/>

    <schedd_host hostname="samgrid.fnal.gov"/>

    <condor_host hostname="samgrid.fnal.gov"/>

    <network_interface>

      <public_interface ip="131.225.167.1" />

    </network_interface>

    <structured_jobs structured_jobs="no"/>

  </condor_config_parameters>

  <MyProxy_Server hostname="fermigrid4.fnal.gov"/>

</jim_client_configuration>

 

Configuration Parameters:

uid_domain: Enter your domain

schedd_host: Enter the hostname of the submission site

condor_host: Enter the hostname of the jim_broker. Use default.

public_interface, network_interface: Enter the IP address of your system that you want to use. In case if you have multiple network interfaces you should use the IP address of the interface that is accessible from outside your local network. To get the information of various interfaces on your system run /sbin/ifconfig in another window

structured_jobs: Enter if you want to run structured jobs. Answer 'no' here

MyProxy_Server: Enter the address of the MyProxy_Server. Use the default

Notes:

You can ignore warnings about xmldb_client: by default, the JIM configuration manager will try to store this configuration into an xml database; this is not required for jim_client and the automatic FS storage is sufficient.

Crating new environment

ups new jim_client

Ups command "new" creates and declares qualifier name which can be used to tailor and store multiply accessible unique jim_client configurations.  ups new jim_config will prompt for  user input and will do the steps to declare new instance of the product in ups database.  The newly declared product will need to be tailored the same way as its non qualifier based version outlined in the previous step. To do that, the specified qualifier name must be explicitly used. I.e.  ups tailor jim_client -q <new qualifier>.

The new environment will be available after  setting up  jim_client with “setup jim_client -q <new qualifier>”

Congratulations. You may start submitting your job if your submission site is configured.

End of Client site Installation!

3.4           Submission Site Installation

3.4.1                General configuration

Make sure you have followed the middleware installation instructions at paragraph 3.2; in particular you need to install condor and Globus, configure GSI, request a service certificate, install the XML database and store the global SAMGrid constants into it.

3.4.2                Installation of the JIM Broker client

Product

jim_broker_client

Install as

Sam

Install Operation

upd install jim_broker_client -G-c

Tailor as

Sam

Tailor Operation

$ups tailor jim_broker_client

 

Configuration example:
<?xml version="1.0"?>

<jim_broker_client_configuration>

  <interview_schema version="1_6" />

  <condor_config_parameters>

    <uid_domain domain="fnal.gov" />

    <local_dir dir="/data/jim" />

    <spool_dir dir="/data1/jim" />

    <condor_host hostname="samgrid.fnal.gov" />

    <condor_admin_email email="parag@fnal.gov" />

    <network_interface ip="131.225.110.153" />

    <broker_identity subject="/DC=org/DC=doegrids/OU=Services/CN=sam/samgrid.fnal.gov" />

    <condor_lowport_highport lowport_highport="49152,65535" />

    <site_name site_name="samgrid.fnal.gov" />

  </condor_config_parameters>

</jim_broker_client_configuration>

 

Configuration Parameters:

uid_domain: Enter your domain

local_dir: Enter Full path where you want to store your log files for the JIM suite. It must be a local path. A directory called jim_broker_client will be created automatically inside this local_dir when you first start the scheduler. Logs and gridmapfile pertaining to the JIM broker client installation will be stored here. User ‘sam’ should have write access to this directory

spool_dir: Enter Full path where you want to store your spool files for the JIM suite. It must be a local path. A directory called jim_broker_client will be created automatically inside this spool_dir when you first start the scheduler. The spool area will be your location to store input and output sandboxes for JIM broker client. User ‘sam’ should have write access to this directory

condor_host: Enter the hostname of the broker

condor_admin_email: Enter the administrator email-id for this installation

public_interface, network_interface: Enter the IP address of your system that you want to use. In case if you have multiple network interfaces you should use the IP address of the interface that is accessible from outside your local network. To get the information of various interfaces on your system run /sbin/ifconfig in another window

broker_identity: Enter the certificate subject of the Broker. Use default.

condor_lowport_highport: Enter the range of port number on which you want the condor processes to run (eg 50101, 50120). Please note that this is important if the schedd node is behind a firewall

site_name: Enter Site Name. This is the name which will appear in the class Ad of the schedd and will be displayed on the web.

 

Notes:

Define the variable SAMGRID_LOCAL_DIRECTORY, as explained in Section System Configuration, to sensible defaults.

 

You will be asked several questions. Choose a directory for the job spooling area: User “sam” will write on this area on behalf of the user's job the input sandbox and other files. A good location is the samgrid local area (where also the local ups dir generally is) in a directory called "jim". AFTER tailoring you'll need to chown -R this area to user sam. Don't change the defaults of the other questions if you don't absolutely know what you are doing.

 

Other Useful Tasks:

·        Generate gridmapfile from voms

Refer to the Section “Automating the Maintenance Tasks”

 

Following tasks are not supported any more.

·        To add a new user to use your Submission site, execute

 

$ups AddUser jim_broker_client

 

NOTE: you can add users ONLY AFTER you successfully started jim_broker_client once.

 

·        Optionally, if you are an advanced user and want to add multiple user at the same time, first create an input_file with list of Grid subjects and execute

 

             $<jim_broker_client_prod_dir>ups/gridmap_gen.py  <input_file  >>condor_schedd_gridmap_file

Run as

Sam

Run Operation

ups run jim_broker_client &

Notes:

Refer Section starting up the servers for instructions to run all servers.

DO NOT DO THIS STEP UNTIL YOU HAVE INSTALLED THE SAM SERVICE CERTIFICATE.

3.4.3                Installing Output retrieval via web

This is an optional package for users who prefer to retrieve their output from a web page after the job is completed.

·        Make sure your servlet runner is installed and configured properly in your submission site. You may optionally install our distribution of tomcat.

·        Install jim_www_sandbox servlet

Product

jim_www_sandbox

Install as

Sam

Install operation

upd install jim_www_sandbox -G-c

Tailor as

Sam

Tailor Operation

ups tailor jim_www_sandbox

 

Configuration example:
<?xml version="1.0" encoding="UTF-8"?>

<jim_www_sandbox_configuration>

  <interview_schema version="1_1"/>

  <nonsecure_services url="http://samgrid.fnal.gov:7080" directory="/data/products/ups/db/tomcat/webapps"/>

  <secure_services url="https://samgrid.fnal.gov:7081" directory="/data/products/ups/db/tomcat/secureapps"/>

  <jim_out_sandbox servlet_secure="no"/>

</jim_www_sandbox_configuration>

 

Configuration Parameters:

nonsecure_services, directory: Enter the directory used by your servlet engine to run non secure servlets.
nonsecure_services, url: Enter the URL from which non secure servlets are hosted.

secure_services, directory: Enter the directory used by your servlet engine to run secure servlets.
secure_services, url: Enter the URL from which secure servlets are hosted.

Notes:

You will be prompted to enter the location where you have install servlet in your machine. After tailoring this you may need to restart the servlet runner.

·        If you are using other distributions of tomcat (or) servlet runner, you may need to do the following additional step.

o   Verify Broker client is configured properly and its environment is accessible to the servlet runner during startup i.e., your servlet runner should do “setup jim_broker_client” on the same terminal before start up so that it gets access to the Broker client’s environment.

End of Submission site Installation.

Skip to starting up the servers.

3.5           Execution Site Installation

 

IMPORTANT:  Before you proceed with the usual “upd install / ups tailor” routine, be sure to read, understand, and execute instructions from the document describing the grid to fabric job submission interface. This job submission is the core of the execution site installation (even though this is merely 1 or 2 packages out of 20 or so total for the execution site) and has historically caused the most questions and problems. Please do not install the rest of the execution site if the local job submission is not working!

 

3.5.1                General configuration

Make sure you have followed the middleware installation instructions at paragraph 3.2; in particular you need to install Condor and Globus, configure GSI, request a service certificate, install the XML database and store the global SAMGrid constants into it.

3.5.2                Install sam

See http://d0db.fnal.gov/sam/ for DZero and CDF instructions.

Install the latest version of SAM and declare it current. Refer to the Samgrid release cuts at http://www-d0.fnal.gov/computing/grid/releases/  Also make sure that "setup SAM -q d0_prd" (or -q cdf_prd) sets up the latest version (check for example $SAM_DIR). If this is not the case declare the previous versions "old".

We recommend that the installation of the JIM products be done on a separate ups database owned by user sam: this is the only set of products needed by the JIM software. On the other hand, generally the SAM software is installed on a product area owned by user products or cdfsoft. The JIM execution site software will need access to a few SAM products (see below): we found convenient simply to install and configure them again in the JIM ups database:

·          sam client: the code and the configuration. By default, the JIM software will execute “setup SAM -q d0_prd” (or cdf_prd) to get the SAM client environment.

·          sam_cp_config: needs to be configured for intra-cluster transfers. Typically jim_gridftp or fcp is used. You can add the line ‘.’ : [ ‘jim_gridftp’, ], to you domain capability map

3.5.3                Setting up durable location (Optional)

You may optionally decide to use a durable location setup at a different/central site or you may setup a durable location on site. The durable location will be used by Samgrid jobs to store production files before they are merged and finally stored to the tapes. To setup durable location, you need to refer to Samgrid’s latest release cut at http://www-d0.fnal.gov/computing/grid/releases/ Install packages listed under “Middleware packages”, “Sam client packages”, “Sam Station packages” and jim_gridftp. If the durable location is on a machine that acts as a Samgrid head node or station node, most of these packages should already exist. If not, please refer to individual package installation and configuration. Once the installation is complete, register the location to SAM by sending an email to the SAM shifters with the name of the machine, path to the storage and the disk size of the storage. Configure the local_storage in site configuration to use the durable location. Refer to Site configuration for more details. If need to configure multiple durable locations, please refer to documentation on configuring complex site with application specific queues and storages at http://www-d0.fnal.gov/computing/grid/doc/Application-ResourceTuning-01Aug05-cut.pdf

3.5.4                Get a host certificate

 

As user

Root

Operations

You need to request a host certificate to a Certificate Authority (CA) for your gateway node (typically 1 day response). SAM-Grid works mostly with the DOEGrids CA, but other CAs may be trusted as well. Contact d0sam-admin@fnal.gov or cdfsam-admin@fnal.gov for more information.

 

The GSI security binaries can be made available to your shell via

 

$setup VDT

 

the command to request the certificate to the DOEGrids CA is

 

$ GRID_SECURITY_DIR=/etc/grid-security grid-cert-request -host `hostname -f` -ca 1c3f2ca8

 

Follow instructions at http://www.grid.iu.edu/osg-ra/HostRequest.php, in particular you need to fill in a certificate request form. The relevant form is at https://pki1.doegrids.org/ca/ , clicking “Grid or SSL Server”.

 

When you are ready to fill in the form, use “Affiliation” OSG and “Experiment” DZero/CDF; you can mention in the comment that the certificate is for SAM-Grid.

 

3.5.5                Get the list of users authorized to use the resources (gridmap-file)

You need to configure your system with the list of users allowed to run jobs at your resources. This list is called gridmap-file, as it maps the grid subjects of the users to the local unix accounts that run the job. The SAM-Grid has developed a tool that uses sam_gridftp to get an “official” list of users belonging to CDF or DZero. Before doing the following commands, make sure your sam_gridftp is installed and working. In particular, DO THE FOLLOWING COMMANDS ONLY AFTER YOU'VE RECEIVED THE SAM SERVICE CERTIFICATE.

 

Product

sam_gsi_config_util

Install as

Sam

Install operation

upd install sam_gsi_config_util --q VDT -G-c

Run as

Root

Run Operation

$setup sam_gsi_config_util

$sam_gsi_get_gridmap --gatekeeper --local-user=<user-running-jobs>

Note: this command will append the subjects of the DZero/CDF VO to your local grid-mapfile. If you have an old grid-mapfile, make sure that the mapping of the subjects to the user that runs the job is right, before using the tool: you may end up with the same subject mapped to two different users.

Optional

- Edit your crontab for root and add something like

 

0 * * * * . /usr/local/etc/setups.sh && setup sam_gsi_config_util && sam_gsi_get_gridmap --gatekeeper --no-default-gridmap –local-user=samgrid > /dev/null 2>&1

 

This will keep up to date your grid-mapfile.

 

3.5.6                Install SAM-Grid Globus job-managers and sandboxing mechanisms

Product

jim_job_managers

Install as

Sam

Install Operation

upd install jim_job_managers -G-c

Tailor as

Sam

Tailor Operation

ups tailor jim_job_managers

 

Configuration example:
<?xml version="1.0"?>

<jim_job_managers_configuration>

  <interview_schema version="2_1" />

  <ups setup_script="/local/ups/etc/setups.sh" />

  <experiment experiment_name="d0">

    <sam_prd_qualifier sam_prd_qualifier="d0_prd" />

    <sam_dev_qualifier sam_dev_qualifier="d0_dev" />

  </experiment>

  <dzero_monte_carlo>

    <events number_per_output_file="250" />

    <accelerator bootstrap_time="180" interval="90" />

  </dzero_monte_carlo>

  <cdf_monte_carlo>

    <accelerator bootstrap_time="180" interval="90" />

  </cdf_monte_carlo>

  <dzero_merge>

    <accelerator bootstrap_time="180" interval="180" />

  </dzero_merge>

  <dzero_reconstruction>

    <accelerator bootstrap_time="180" interval="300" />

  </dzero_reconstruction>

  <dzero_reco_merge>

    <accelerator bootstrap_time="180" interval="180" />

  </dzero_reco_merge>

  <dzero_tmbfix>

    <accelerator bootstrap_time="180" interval="300" />

  </dzero_tmbfix>

  <dzero_skimming>

    <accelerator bootstrap_time="180" interval="300" />

  </dzero_skimming>

  <local_tmp_area local_tmp_area="/data/jim/jim_tmp/" />

  <polling_interval grid_update_interval="300"

                    xml_update_interval="300" />

</jim_job_managers_configuration>

 

Configuration Parameters:


dzero_monte_carlo, dzero_merge, dzero_reconstruction, dzero_reco_merge, dzero_skimming, dzero_tmbfix:
Dzero application/job types which Samgrid supports. Add a similar tag when a new job type is supported.
accelerator, bootstrap_time: Time intervals in minutes given for the job to bootstrap before the monitor scripts considers the job as stuck and kills it. Job is considered inactive if there is no disk write activity taking place. Bootstrap time is only considered once at the beginning of the job.

accelerator, interval: Time intervals in minutes after which the job is killed if there is no disk activity.

Events,number_per_output_file: Default, maximum number of events written to MC output file. Number of events per output file can be over ridden by specifying runjob_numevts and, events_per_file in the JDL

Local_tmp_area: Temporary area used by jim_job_managers to write files to. ‘sam’ should have write access to the directory.

Polling_interval, grid_update_interval

 

Note: To configure jim_job_managers with application specific queues, please refer to the documentation available off the Samgrid home page.

 

Product

jim_sandbox

Install as

Sam

Install Operation

upd install jim_sandbox -G-c

Tailor as

Sam

Tailor Operation

ups tailor jim_sandbox

 

Configuration example:
<?xml version="1.0"?>

<jim_sandbox_configuration home="/data/jim/jim_sandbox">

  <interview_schema version="3_0" />

  <keep_sandbox compressed="no" />

</jim_sandbox_configuration>

 

Configuration Parameters:

jim_sandbox_configuration, home: Enter the directory that handles the input sandboxes at the head node. The disk space required is large (typically hundreds of GB). This directory should be writable by user ‘sam’ and the user running jobs, typically ‘samgrid’

keep_sandbox, compressed: Do you want to keep the sandbox compressed at the head node.

 

Product

jim_gridftp

Install as

Sam

Install Operation

upd install jim_gridftp -G-c

Tailor as

Sam

Tailor Operation

ups tailor jim_gridftp

 

Configuration example:
<?xml version="1.0"?>

<jim_gridftp_configuration>

  <interview_schema version="2_0" />

  <host name="samgfarm4.fnal.gov">

    <data_server>

      <port number="4568" />

      <certificate subject="/DC=org/DC=doegrids/OU=Services/CN=sam/samgfarm4.fnal.gov"/>

    </data_server>

    <head_server>

      <port number="4569" />

      <certificate subject="/DC=org/DC=doegrids/OU=Services/CN=sam/samgfarm4.fnal.gov"/>

    </head_server>

  </host>

</jim_gridftp_configuration>

 

Configuration Parameters:

port: port on which data/head servers are running

certificate, subject: DN of the service.

Notes:

This product is used to start a gridftp server at the gateway node, for gateway/worker nodes transfers, and a server where the SAM station is, for data transfer.

 

Product

sam_fcp

Install as

Sam

Install Operation

Upd install sam_fcp -G-c

Tailor as

Sam

Tailor Operation

Ups tailor sam_fcp

 

Configuration example:
<?xml version="1.0"?>

<sam_fcp_configuration>

  <interview_schema version="1_0" />

 

  <fcp_queue name="default">

    <fcp_port port="7788" />

    <max_xfers transfers="15" />

    <transfer_mechanism name="jim_gridftp" />

    <time_out value="3600" />

  </fcp_queue>

  <fcp_queue name="default1">

    <fcp_port port="7789" />

    <max_xfers transfers="15" />

    <transfer_mechanism name="jim_gridftp" />

    <time_out value="3600" />

  </fcp_queue>

</sam_fcp_configuration>

 

Configuration Parameters:

port: port on which data/head servers are running

certificate, subject: DN of the service.

Notes:

This product is used to control the number of concurrent transfers from the head node to the worker node.

3.5.7                Creating the Resource Description

Product

Jim_config

Configure as

Sam

Configure operation

Describe the resources at your site to the JIM software suite by answering the questions prompted by issuing

$ ups configure_complex_site jim_config

or

$ ups configure_site jim_config

(for sites with 1 cluster, 1 gatekeeper, 1 jobmanager, 1 station,
1 durable storage area)

Notes:

The resources at a site are organized with the following hierarchy:

·        a site can have multiple clusters

·        a cluster can have multiple "gatekeepers" (grid gateways)

·        a gatekeeper can have multiple "jobmanagers" (grid interfaces to local resource manager interfaces)

·        a "jobmanager" can submit jobs to multiple SAM station (typically in different universes i.e. production or development)

·        a cluster can also have multiple SAM stations not accessible via the grid (sam stations that are not "under" any gatekeepers and used for local submission)

·        a cluster can have a durable storage, to keep intermediate processing files. This is configured by setting up local_storage tag in the configuration.

·        To configure a complex site with application specific queues, please refer appropriate documentation on the Samgrid homepage.

 

Example in xml of a site description:

 

<?xml version="1.0"?>

<site_configuration>

  <site name="FNAL" />

  <schema version="v1_0" />

  <cluster name="SamGrid-testbed" architecture="Linux+2.4">

    <gatekeeper location="samadams.fnal.gov:2119">

      <jobmanager name="jobmanager-sam">

        <station name="samadams" universe="dev" experiment="d0" />

        <station name="samadams" universe="prd" experiment="d0" />

      </jobmanager>

    </gatekeeper>

  </cluster>

  <local_storage path="/data/sam/disk/durable_location"

                 node="samgfarm4.fnal.gov" />

</site_configuration>

3.5.8                Installing the resource advertisement software

 

Product

jim_advertise

Install as

Sam

Install operation

upd install jim_advertise -G-c

Tailor as

Sam

Tailor Operation

ups tailor jim_advertise

 

Configuration example:
<?xml version="1.0"?>

<jim_advertise_configuration>

  <interview_schema version="1_3" />

  <verbose value="true" />

  <log_file path="/data/jim/jim_advertise/log" />

  <advertise_interval interval="180" />

  <collector fqdn="samgrid.fnal.gov" />

  <extra_condor_advertise_args arguments="-debug" />

  <classad_generator

     xquery="${JIM_ADVERTISE_DIR}/bin/xml2classad_cgs.xq" />

  <post_filter exe="${JIM_ADVERTISE_DIR}/bin/postFilter_cgs.sh" />

  <condor_config_parameters>

    <network_interface ip="131.225.167.1" />

  </condor_config_parameters>

</jim_advertise_configuration>

 

Configuration Parameters:

verbose, value: Do you want JIM advertise to log debug messages while executing (the log file becomes very big)

log_file, path: Enter the full path of your log file (missing directories will be created automatically when you first start jim_advertise).

advertise_interval: Enter the inverval in seconds for sending the classads to the collector

collector, fqdn: Enter the Fully Qualified Domain Name of the collector. Use default.

extra_condor_advertise_args: Enter any condor extra arguments you want to send to the collector (e.g. –tcp -debug)

classad_generator: This xquery will decide how to publish resources in the form of classad from the site configuration XML file. Use defaults.

post_filter: Enter the full path to post filtering script if you have one. This script will take the output of the Classad generation procedure and should produce an output in the same form of the input. The use of environment variables is allowed. Use defaults.

condor_config_parameters, network_interface: Enter the IP address of the system that you want to use. If you have multiple network interfaces you should use the IP address of the interface that is accessible from outside your local network. To get the information of various interfaces on your system run /sbin/ifconfig in another window.

Run as

Sam

Run Operation

ups run jim_advertise &

Notes:

Refer Section starting up the servers for instructions to run all servers.

End of Execution site Installation.

Skip to starting up the servers.

3.6           Monitoring Site Installation

The SAM-Grid monitoring service is available on the web at http://samgrid.fnal.gov:8080

 

In order for a site to be monitored, there are 3 steps to follow:

1.      install Globus MDS at least on one machine of the site.

2.      Create site configuration.

3.      configure/update MDS with the SAM-Grid schema/information hierarchy

4.      inform the SAM-Grid team of the availability of the new monitoring site with following details: Host and Port where MDS is running, Jim-Site's name that was chosen by the site administrator while tailoring of jim_info_providers.

3.6.1                Create site Configuration

You may skip this sub-section if you have already configured the site information for advertisement framework. In case you have not done yet, follow Creating Site Configuration now.

3.6.2                Configure/Update MDS

Product

jim_info_providers

Install as

Sam

Install operation

$ upd install jim_info_providers -q GCC-2.95.2 -G-c

Tailor as

Sam

Tailor operation

$ ups tailor jim_info_providers -q GCC-2.95.2

Run as

Sam

Run operation

$ ups start jim_info_providers -q GCC-2.95.2

Notes:

Ignoring warning messages at startup time is generally ok. Refer Section starting up the servers for instructions to run all servers.

 

End of Execution site Installation.

You may start the servers now.

4            Starting the Servers

After you've installed all the components of SAM-Grid, i.e. JIM and/or SAM, install the package that runs the servers.

Product

server_run

Install as

Sam

Install operation

upd install -G-c server_run

Tailor as

Sam

Tailor operation

ups tailor server_run

Configure as

Root

 

Edit the apporopriate system bootstrap files in /etc/rc.XXX  (/etc/rc.d/rc.local is a good choice) so that the following is effected at the system bootstrap time:

 

$ su SAM -c /full/path/samgrid_startup.sh

 

where a file samgrid_startup.sh contains something like:

 

#!/bin/sh

source $SETUPS_DIR/setups.sh

ups run server_run

 

If your new installation of server_run also includes the SAM server suite, you might also want to disable automatic start-up of SAM servers by the lower-level sam_bootstrap package.

 

Strongly Recommend

If you can, reboot your machine at this time to check that server_run is started up properly upon system boot. We don't ask you to do this out of affection to the popular PC operating system! ;-).

If impossible, start the servers now (see below).

Run as

Sam

 

If you were running the XML DB servers during this installation (recommended), remember to stop it at this time:

 

$ ups stop xmldb_server

 

You can now run SAM-Grid (assuming you have all the certificates in place):

 

On a new terminal window (for a clean environment) type

$ ups run server_run

Congratulations. Your installation is complete.

5            Modifying the Product Configuration

Product

samgrid_util, product whose configuration you want to modify.

Execute as

Sam.

Execute operation

$ setup samgrid_util

$ setup <product name>

$ jim_configure.sh <product name>


This will open the products configuration in the vi editor for editing. Make the required changes to the configuration and save and exit the vi session.

Example:

1.      Modifying sam_fcp configuration:
$ setup samgrid_util
$ setup sam_fcp
$ jim_configures.sh sam_fcp

2.      Modifying site_configration:
$ setup samgrid_util
$ setup jim_config
$jim_configure.sh jim_config

6            Automating the Maintenance Tasks

6.1           Regular Cleanup and Maintenance Tools

 

Newer versions of samgrid_util (v3_1_8+) package have useful script is the cron directory of the product. These scripts can be installed in crontab to automate some of the regular maintenance tasks via cronjob. This section describe different tools available and how to use them

6.1.1                Cleaning up old Globus files and jim sandboxes

 

Product

samgrid_util

Execute as

samgrid on the head node or forwarding node once a day

Execute operation

$ setup samgrid_util

$SAMGRID_UTIL_DIR/cron/samgrid_ce_disk_cleanup.sh --gramDir=<Dir containing gram_job_mgr_*log and gram_scratch_*> --gassDir=<Globus gass cache dir> --sandboxDir=<jim_sandbox dir>

This script cleans up the old job files left behind by Globus in home area of user samgrid. It can also be used to clean up the old sandboxes from jim_sandbox area.

Clean up policy is –

·        Gram_scratch* : 10 days old

·        Gram_job_mgr*log: 3 days old

·        Jim_sandbox dirs: 30 days old

 

6.1.2                Cleaning up CondorG queue for OSG jobs

 

Product

samgrid_util

Execute as

samgrid on the forwarding node once a day

Execute operation

$ setup samgrid_util

$SAMGRID_UTIL_DIR/cron/mark_osg_jobs_for_deletion.sh --hold-days=<Number of days. Jobs older than this number will be marked for deletion. Min value 5>

This script cleans up the jobs that are in held state in CondorG on the forwarding node. This helps keep the response from CondorG on the forwarding node fast over a period of time by keeping the queue length from growing too big with old jobs.

This script only works on the newer forwarding nodes (samgfwd0x) and mandates that jobs that are less than 5 days old will not be cleaned up.

Note: If the forwarding node has been in production for a while without this script in place, you should run the script manually for first few times before putting it in cronttab. After few days of production, number of held jobs could quickly increase beyond 100,000. Cleaning up so many jobs is an intensive process and is best handled incrementally. If this is not handled incrementally, you will notice that CondorG on the forwarding node will become unresponsive. It is highly recommended to start with a very large number for --hold-days, and gradually reducing it by a week (7 days) or 10 days based on how many jobs are there in the held state. It is best to keep the number of jobs cleaned at a time to 10,000 and adjust the number of days accordingly.

To see the jobs that are in held state -

$ setup samgrid_osg_client

$ condor_q -constraint “JobStatus == 5”

6.1.3                Cleaning up CondorG queue for Samgrid jobs

 

Product

samgrid_util

Execute as

sam on the samgrid.fnal.gov node once a week

Execute operation

$ setup samgrid_util

$SAMGRID_UTIL_DIR/cron/mark_samgrid_jobs_for_deletion.sh --max-days=<Number of days. Jobs older than this number will be marked for deletion. Min value 60> --debug=<true|false Case sensitive. Behavior defaults to true>

This script cleans up the jobs that are in CondorG on the queuing node, samgrid.fnal.gov. This helps keep the response from CondorG on the queuing node fast over a period of time by keeping the queue length from growing too big with old jobs.

Note: If the queuing node has been in production for more than 7 months without this script in place, you should run the script manually for first few times before putting it in cronttab. After few days of production, number of jobs could quickly increase. Cleaning up so many jobs is an intensive process and is best handled incrementally. If this is not handled incrementally, you will notice that CondorG on the queuing node will become unresponsive. It is highly recommended to start with a very large number for --max-days, and gradually reducing it by a week (7 days).

6.1.4                Rotate log files daily and archive them Monthly

 

Product

samgrid_util

Execute as

sam on the forwarding node once a day at 00:05 am

Execute operation

$ setup samgrid_util; setup vdt; setup jim_advertise; setup tomcat;  $SAMGRID_UTIL_DIR/cron/samgrid_rotate_logs.sh --logrotate-workdir=/samgrid/logs/jimlogs/samgrid_log_rotate --globus-log-dir=$GLOBUS_LOCATION/var --jimadvertise-log-dir=/samgrid/logs/jimlogs/jim_advertise --tomcat-log-dir=$TOMCAT_DIR/logs

It is strongly recommended to run this script at 00:05 am via cron.This script rotates samgrid jobmanager logs, globus gatekeeper and accounting logs, globus gridftp logs and jim_advertise logs and stores the old log file in a specific naming format. This log file naming format is required and used by the log archiving tool to archive old logs.

 

Product

samgrid_util

Execute as

sam on the forwarding node once every month

Execute operation

$ setup tomcat; setup vdt; setup jim_advertise; $SAMGRID_UTIL_DIR/cron/archive_samgrid_logs.sh --archive-dir=/samgrid/logs/logsarchives --tomcat-log-dir=$TOMCAT_DIR/logs --globus-log-dir=$GLOBUS_LOCATION/var --jimadvertise-log-dir=/samgrid/logs/jimlogs/jim_advertise

This tools archives logs that are older than current month, zips them and stores them in the directory specified by --archive-dir. This tool expects that the names of the old log files follow a specific convention achieved from running samgrid_rotate_logs.sh above.

6.1.5                Relocate condor job spool directories for jim_broker_client

 

Product

samgrid_util

Execute as

sam based on the needs

Execute operation

$ setup samgrid_util

$SAMGRID_UTIL_DIR/cron/archive-jim_broker_client-spool.sh --number-of-jobs=250

In this case oldest 250 job spool areas will be moved from the current location /samgrid/logs/jimlogs/jim_broker_client/spool to /samgrid/logs/jimlogs/jim_broker_client/spool/archive/spool.1/ To avoid human error maximum number of jobs is capped to 250. If you need to move more dirs at a time, you can run the tool again.

A better solution would be to have two queueing nodes. One for Monte Carlo and other for Reconstruction to make the infrastructure more scalable.

 

6.2           Automate security setup tasks

6.2.1          Generate gridmapfile for jim_broker_client from the DZero member list in voms

 

Product

sam_gsi_config, jim_broker_client

Execute as

sam on the queuing node one to two times a day

Execute operation

$ setup sam_gsi_config –q vdt; setup jim_broker_client; generate_submission_site_gridmap

 

Requires sam_gsi_config v2_3_5 -q vdt or higher

 

6.2.2          Automatically fetch the latest CA certificate files and update samgrid ca files.

 

Product

Vdt, sam_gsi_config

Execute as

Root for vdt related command. Affects all vdt installation
sam for sam_gsi_config related commands

Execute operation

Root related commands:

root$ setup vdt

root$ vdt-control --on vdt-update-certs

root$ vdt-control --on fetch-crl

 

sam_gsi_config related commands:

Put a crontab to run following commands daily as user sam:

setup sam_gsi_config -q vdt; sam_gsi_install_ca --force-copy

 

Requires sam_gsi_config v2_3_5 -q vdt or higher

 

 

7            Quick-Start

7.1           Job Submission

Product

jim_client

Execute as

Anyone with proper credentials.

Execute operation

samg submit <myjob.jdf> [condor_submit args]

Notes

[condor_submit args] are the arguments passed directly to the condor_submit command. e.g “–r schedd.machine.fqdn”.

7.1.1                A typical SAM Analysis Job submission

 

The steps to submit a job are

1.      User creates a job description file (MyJob.jdf) into his/her writable working directory:

A typical SAMAnalsyis job is given below

------------------------------ ------------------------------

sam_dataset = ab2files

station_name = sammy

executable = /home/murthi/testbed/samanalysis/retrieve.sh

job_manager = sam

job_type = sam_analysis

sam_universe = dev

sam_experiment = d0

output = /tmp/murthi/hello_sammy.output

error = /tmp/murthi/hello_sammy.error

cpu-per-event = 1s

group = grid

instances = 1

Globusscheduler = $$(gatekeeper_url_)

------------------------------ ------------------------------

You can download an example of a test script from here.

 

2.      By entering 'samg submit MyJob.jdf' the job is submitted to the SAM-Grid for execution.

 

 

$samg submit samanalysis.samgjdf_sammy

 

Checking Grid credentials...

Ok.

 

Job(s) submitted successfully.

 

Global JID = murthi_samadams.fnal.gov_130922_17513

 

You will get the Global Job Id which can be used for reference when monitoring the job. For each instance of a job, output and error files are generated. In addition to this you can specify a log file that will keep track of the job during submission. These files are important especially for troubleshooting. Optionally, you can also submit your job to a specific scheduler listed on your collector by invoking the command below.

 

samg submit <samg_jdf> -f schedd.machine.fqdn

 

3.      You may optionally check the jobs status on the queue by

$ samg list jobs murthi_samadams.fnal.gov_130922_17513_0

 

The job can be referenced from the monitoring site by its Global Job ID.

7.1.2                Job Description File

 

In order to submit a job to SAM-Grid, you need to create a job description file (jdf). The jdf can contain a number of attributes from which some are required.

 

The syntax that is required for the jdf is case-sensitive. The order of attributes is not required. However, when running job instances the "instances" attribute should be located after its attributes have been defined. In case of multiple instances, only the attributes that are changed should be written again in the jdf. An example of is shown above.

7.1.2.1          Attributes

They are grouped according to the way the 'samg submit' handles them.  It is possible to have different types of jobs.  "sam_analysis" jobs are brokered to a SAM-Grid resource. The job type "caf" uses caf resources. Also there is some ongoing work to support "monte_carlo" jobs.

The required attributes for sam_analysis jobs are "sam_dataset", "sam_universe", "sam_experiment", "executable", "cpu-per-event", and "instances".

8            FAQ

More information on troubleshooting can be found at http://www-d0.fnal.gov/computing/grid/JIM-FAQ.htm.

9            Appendix A: The SAMGrid JDL

 

These are the specifications for the SAM-Grid job description language.

 

The job description language distinguishes the different job types (sam_analysis, vanilla, caf, mc_runjob, etc). In the specifications presented below, a section for each type is provided. In addition some extensions that are not yet fully productized are listed. These extensions are therefore not recommended for usage.

 

Note for advanced users

“samg submit” converts  JDL of a supported job type to Condor JDL with some exceptions.

·        Attributes with “+” prefixed overrides the auto generation of these attributes by “samg submit”. These attributes gets printed on the classads of the generated Condor JDL with no “+” prefixed.

·        Attributes with “++” prefixed does not override the auto generation of attributes, instead they are printed on the generated Condor JDL with just one “+” prefixed.

9.1.1                Common JDL Specifications

 

The specifications listed below apply for all types of jobs.

9.1.1.1          Required attributes

 

job_type = <keyword>

job_type refers to a unique keyword that denotes a  specific job type. Valid keywords are

montecarlo, mc_runjob (deprecated), merge, structured, samanalysis and caf.

 

instances = 1

Currently multiple instances of jobs are not supported.

9.1.1.2          Optional attributes

Some of the attributes listed below may be a required attribute for some job types (or) some default value might be required if declared. Please refer to the job specific JDL for any final saying on how it should be used.

 

input_sandbox = <directory>

Specifies a directory that serves as the input sandbox.  This directory will be shipped to the execution site. The sandbox must contain the executable as well as other files needed for the job.

 

input_sandbox_tgz = <pathname to a tar.gz file>

Specifies a “tar.gz” file that serves as the input sandbox. This bundle (input_sandbox_tgz) will be shipped to the execution site. The sandbox must contain the executable as well as the other files needed for the job.

 

log = <pathname>

Log for grid specific information esp. useful for debugging. Note: This is not the user job's log.

 

input = <pathname>

Any standard input the job needs while running. The pathname refers to a local file name.

 

output = <pathname>

error = <pathname>

These refer to the local files where the job’s standard output and error will be shipped back from the execution site. This feature is not completely functional. The output & error doesn’t reach the client’s side. But can be extracted from the submission site.

 

jobmanager_name = <keyword>

jobmanager_name refers to the job manager used at the execution site, e.g. sam

 

Globusscheduler = <scheduler-name>

Specifies the Globus resource to which the job should be submitted.  The default is the matched resource.

 

station_name = <stationname>

The station name at which the job will be executed, assuming that the requirements are satisfied. If user does not define the station name, brokering will determine it from the matching station. However, station name may be declared if user prefers a certain station.

 

requirements = <Boolean expression>

The expression must evaluate to true on the matching machine.  The requirements specified by the user get appended to the default requirements generated by the jim_client.

 

arguments = <executable_args>

Parameters to be passed to the executable. The parameters must NOT be enclosed into double quotes (e.g. arguments = arg1 arg2 arg3)

 

grid_resource_requirements_string = <Resource Contact | Constraints expressed in GlueSchemaFormat>

 

Example:

Submitting job to a specific resource:

- grid_resource_requirement_string = cmsosgce.fnal.gov:2119/jobmanager-condor

- grid_resource_requirement_string = (TARGET.GlueCEInfoHostName =?= "stitch.oscer.ou.edu")

 

Submitting job using OSG ReSS to a resource matching constraints:

- grid_resource_requirement_string = (stringlistimember("VO:dzero", TARGET.GlueCEAccessControlBaseRule, ",") && stringlistimember("OSG-0.4.1", TARGET.GlueHostApplicationSoftwareRunTimeEnvironment, ","))

- grid_resource_requirement_string = (GlueCEInfoContactString == "red.unl.edu:2119/jobmanager-pbs") || (GlueCEInfoContactString == "cmsosgce.fnal.gov:2119/jobmanager-condor") || (GlueCEInfoContactString == "grid1.oscer.ou.edu:2119/jobmanager-lsf") || (GlueCEInfoContactString == "osg-gw-2.t2.ucsd.edu:2119/jobmanager-condor")

9.1.2                SAM Analysis JDL Specifications

An example can be found at <jim_client_product_dir>/demo_examples/release/samanalysis.samgjdf.

9.1.2.1          Required attributes

cpu-per-event = <value>

The estimated CPU time used per event. s|m|h (seconds|minutes|hours)

 

sam_dataset  = <definition_name>

Name of the dataset definition to be used in the job. The dataset definition must be predefined.

 

sam_universe = <dev | prd>

Specifies the universe for the job. This is required to match with the resource.

 

sam_experiment = <D0 | CDF>

Specifies the experiment for the job. This is also required to match with the resource.

9.1.2.2          Optional attributes

group = <groupname>

The group name to which the job belong.

 

extra_sam_submit_args = <sam submit arguments>

These are arguments of form (--name1=value1 --name2=value2). These arguments are appended to the generated default arguments for “sam submit”. Quotes are not generally recommended.

9.1.3                CAF JDL Specifications

An example can be found at <jim_client_product_dir>/demo_examples/release/caf.samgjdf

9.1.3.1           Required Attributes

input_sandbox_tgz = <tgz_file>

The user directory, already targz'ed. Note that the '_tgz' in the name is needed to distinguish it from the sam_analysis-type attribute 'input_sandbox', which is a local dir to be tgz'ed.

 

caf_initial_section = <integer value>

The number of the initial section.

 

sam_dataset = <definition_name>

Name of the dataset definition to be used in the job.

9.1.3.2          Optional attributes

caf_user_name = <user_name>

The default is <user_name> on the client host.

 

email = <email_address>

The user email address. The default is <user_name>@<hostname> on the client host.

 

output_sandbox = <output_location>

The location to which the output of the job is sent. The default is <user_name>@<hostname>:~<user_name>/<sam_gid>.tgz, where sam_gid is the global job id assigned by the SAM-Grid to the job)

 

caf_job_type = <caf_job_type>

The default is sam.

 

caf_final_section = <integer value>

The number of the final section. The default is the initial_section, i.e. 1 section only.

 

sam_universe = <dev | prd>

Specifies the universe for the job. This is required to match with the resource. The default is prd.

 

sam_experiment = <d0 | cdf>

Specifies the experiment that the job is dedicated to. This is also required to match with the resource. The default is cdf.

 

extra_caf_submit_args = <caf submit arguments>

These are arguments of form (-name1=value1 -name2=value2). These arguments are appended to the generated default arguments for “sam submit”. Quotes are not generally recommended.

9.1.4                Monte Carlo JDL Specifications

 

NOTE: Please note that for storing the files generated in the Monte Carlo run back to Fermi Lab you need to be a member of mc99 data group. You can verify this information at http://d0db.fnal.gov/sam_admin/cgi/autoRegister.py

9.1.4.1           Required Attributes

 runjob_requestid = <monte carlo request number>

The request number which has its details present in the request database. For more information please see http://www-d0.fnal.gov/computing/mcprod/mcc.html

 

runjob_numevts = <Number of events to produce for the Request Id>

The number of events to be produced for the Request Id (runjob_requestid).

 

d0_release_version = <d0 code version>

The version of d0 code that is to be used for producing events for runjob_requestid. The d0 code version should be consistent with the version specified in the jobfiles_dataset (explained below).

 

jobfiles_dataset = <dataset (snapshot) containing the tar balls>

The jobfiles_dataset is the dataset (snapshot) containing the files that are necessary for executing the request. This dataset typically contains but is not limited to, d0 code tree (e.g. d0_p14.03.02.tar.gz), Magnetic Field files (e.g. MagField_v00-01-00.tar.gz) if required, card files (e.g. cardFile_v00-07-00.tar.gz) & mc_runjob code tree (e.g.  mc_runjob_v06-02-02-jim-04.tar.gz).

 

phase_datase_intervals = <comma separated list of event intervals >

The phase_dataset_inervals is the intervals of events you want to process or recovery.

Example: phase_dataset_intervals = 1-250,501-1000,1251-2000

 

9.1.4.2          Optional attributes

 

minbias_dataset = <dataset containing minimum bias events to be overlaid>

The files containing minimum bias events that are to be overlaid for in the digitization phase are specified in this dataset.

 

phase_dataset = <dataset containing the input for a phase in the Monte Carlo chain>

If the request takes the input for a particular phase (typically it’s the generation phase) from SAM, then the dataset containing the input is specified through this attribute. During submission consistency checks are made to determine if the dataset specified by the phase_dataset attribute matches the dataset specified in the request details.

 

phase_skip_num_events = <number of events to skip from the input of the phase_dataset>

This directive configures mc_runjob to skip the number of events specified before reading the input to the phase. This option is useful in particular for error recovery. If some jobs fail, this option allows the user to run the jobs again, reading their expected input event range. The event range of a job that failed can be computed as <job submission index> * <events_per_file> + < phase_skip_num_events> (the last term is usually 0 for the first submission).

 

check_consistency = <Boolean value>

This attribute controls the level of consistency checks that are made during the grid job submission. The default behavior is that of true (all checks are made). A value of false results in some checks (e.g. d0 code version check ) to be skipped. Mandatory checks (e.g. If input is from SAM) are still done.  

 

events_per_file = <number of events per output file>

This attribute states the number of events that are to be produced per output file (or phase).e.g.  events_per_file=250 then a Grid job of 25,000 events will generate 100 files (for each Monte Carlo phase) containing 250 events in each file. If unspecified, the number of events per output file will depend on the execution site at which the grid job executes.

 

9.1.5                Merge Job JDL Specifications

9.1.5.1           Required Attributes

 

d0_release_version = <d0 code version>

The version of d0 code that is to be used for merging files (typically thumbnails). The d0 code version should be consistent with the version specified in the jobfiles_dataset (explained below).

 

jobfiles_dataset = <dataset (snapshot) containing the tar balls>

The jobfiles_dataset is the dataset (snapshot) containing the files that are necessary for executing the merging job. This dataset typically contains but is not limited to, d0 code tree (e.g. d0_p14.03.02.tar.gz) and the mc_runjob code tree (e.g.  mc_runjob_v06-02-02-jim-04.tar.gz). Other optional files like card files and Magnetic field files are not required to execute merging jobs. If they are present in the dataset, they will not affect the outcome of merging jobs.

 

9.1.5.2           Mutually Exclusive attributes

The following attributes are mutually exclusive but at least one of them has to be present to submit a merge job.

 

merge_dataset_name=<dataset (snapshot) containing the files to be merged>

The dataset contains files to be merged (typically thumbnails) and is mutually exclusive with merge_dimension_querry.

 

merge_dimension_query=<dimension query specifying the constraints for identifying files to be merged >

This is the standard dimension query accepted by SAM. Do not specify the query in double quotes.

 

9.1.5.3          Optional attributes

 

check_consistency = <Boolean value>

This attribute controls the level of consistency checks that are made during the grid job submission. The default behavior is that of true (all checks are made). A value of false results in some checks (e.g. d0 code version check) to be skipped. Mandatory checks (e.g. If input is from SAM) are still performed

 

9.1.6                Structured Job JDL Specifications

9.1.6.1          Required Attributes

 

job_structure=<job type1, job type 2 …job type n>

This attribute specifies in what and how various valid samg jobs are to be executed. For example, you can combine a montecarlo job with a merge job as using following -

job_structure= montecarlo, merge

Please note that a montecarlo type job is a parent of a merge type job, i.e. after a montecarlo job is executed only the results of this particular grid job are operated upon by the merge type job.

 

10     Suggestions

Please send your suggestions (or) comments to Parag Mhashilkar and Gabriele Garzoglio.

 

 

 

Last updated on Wednesday, October 14, 2009.



[1] The Resource Selector can be in principle distributed; for the deployment of JIM V1 it will be central, though.

[2] JIM is currently interfaced with the D0MC, CAF and SAM-Submit frameworks.