SETTING UP AN MCFARM JOB SERVER NODE
This
document describes how to set up a LINUX node as a generic farm node- that is
as a job server, a file server and a production node all on the same machine,
once that machine has been setup as directed by the document on Fermi RH 7.1
Installation. This document assumes that machine is dedicated to farm work.
SETTING UP THE DĆ PACKAGES:
Before
getting started on setting up the DĆ packages, create the directory /home/products
and /home/products/fnal, and create a soft link to the latter directory
as /fnal. Then proceed with the following steps.
·
Setting up UPS/UPD: Follow instructions on the
d0race web page:
http://www-hep.uta.edu/~d0race/linux_install.html
.
·
Installing the DĆ binaries:
o
Create directories /home/products/d0dist,
/home/products/d0usr, and create links /d0dist and /d0usr
pointing to these directories. Also create directories /d0dist/dist and /d0usr/products.
o
Download the file UPSd0dist.tar.gz
into the /d0dist/dist directory, and the file UPSd0uprod.tar.gz
into the /d0usr/products directory. Then un-tar each of these files in
their respective download directories. (tar zxvf UPS*.tar.gz), and run the .fix-*
files in their respective directories. (These scripts will prompt for user
input – answer ‘y’ at each of the prompts)
o
Edit the file /fnal/ups/etc/upsdb_list, and add
the lines
/d0usr/products/upsdb
/d0dist/dist/upsdb.
o
Go to the directory /fnal/ups/db/.updfiles, and
rename the updconfigfile which resides there. (EG. mv updconfig
updconfig_old). Then download the file Updconfig
into this directory. In this file, there are four sections each with the
heading “COMMON”. Each of these sections contains a variable “UPS_THIS_DB”.
Make sure that the first two occurrences of this variable are set to /d0dist/dist/upsdb,
the third is set to /d0usr/products/upsdb and the fourth is set to /fnal/ups/db.
We are now ready to actually install the DĆ minitars.
o
Create a link /mcc-dist that points to /home/products/d0dist/dist
o
Download the 8 tar files listed on the MCP 10
page from d0mino.fnal.gov. These minitars reside in the directory /d0dist/dist/minitar/tarfiles.
Download then onto the / directory. Then un-tar each of them from the ‘/’
directory, except for the mc_runjob minitar. Installation of this
minitar will be described later.
This completes the setup of the DĆ minitars for farm operation.
SETTING UP THE SAM STATION:
The job server
is also typically one of the SAM gather servers. For this the job server must
be set up as a SAM station. In order to do this, please follow the detailed
instructions on the webpage for SAM INSTALLATION.
SETTING UP THE MCFARM DIRECTORY STRUCTURE:
Before setting up the directory structure, create the
“mcfarm” account. It is advisable to set up the group for mcfarm and then add
the user account mcfarm by executing the following commands:
groupadd –g 500 mcfarm
useradd –G 500 –g 500 mcfarm
We recommend that you set up the mcfarm account with the
lowest possible group and user ID’s as described by the above two commands, but
if you feel you want to give the group and user higher ID’s do so.
- Create
a directory /home/scratch, and create a link /scratch
pointing to this directory by executing the command:
ln –s /home/scratch /scratch
- Also
make the following directories:
- /scratch/cache_A
- /scratch/error_queue
- /scratch/exec_queue
- /scratch/gath_queue
- /scratch/localbin
- /scratch/logs
- /scratch/root
- /scratch/run
- It is
recommended that there be a specific directory(s) that serve as the farm
work area(s). On the job server, there should be one configured as above. If
you have created more partitions to serve as farm work areas or want to
set apart more directories, then make directories or links like /scratch2,
/scratch3 etc. These can be used for caching or archives, and under
each you can make directories like cache_B, cache_C, etc.
- Create
a link /scrJJJ pointing to /scratch, where JJJ=Server node
number
- Create
a link /gatherJJJ pointing to /scratch/gath_queue.
- Make a
link /cacheJJJ_A pointing to /home/scratch/cache_A.
(JJJ=Server node number, e.g. cache000_A).
- Ownership
of all these directories must be mcfarm.mcfarm. Execute the
following command to effect this:
chown –R mcfarm.mcfarm /home/scratch
chown mcfarm.mcfarm /scratch
SETTING UP THE MCFARM SOFTWARE:
This section describes the actual setting up of the mcfarm
software and configuration. All the following steps must be performed as
user mcfarm unless otherwise specified. (NOTE: In the following
instructions, ‘~’ refers to the mcfarm user, i.e. ~/bin is the same as /home/mcfarm/bin,
and so on.)
- In the
mcfarm directory , un-tar the mc_runjob tar file downloaded
before.
- Also
un-tar the mcfarm minitar that you obtained from the UTA web site in the mcfarm
directory. This will create several directories under the mcfarm
directory.
- Setting
up the root manager daemon:
NOTE: We recommend that the McFarm root daemon be installed at
least while you are bringing up the farm initially. After the farm is up and running, there
is less justification for it (it is not critical to normal operation) so
you may wish to disable it for security reasons. If you do keep it active, protect the mcfarm password as
vigorously as you do the root password.
IF the root manager daemon is to be allowed, the following steps
have to be carried out. NOTE that the first two steps have to be carried
out regardless of whether the root manager daemon is being allowed or not.
- Modify
the file ~/bin/attach_node.template to contain the correct node
number for the job server (EG 000)everywhere, and rename this file as ~/bin/attach_node.
Also, if there are no secondary caches on the job server, comment out the
lines that deal with cache_B, archive_B, etc. On the other hand,
if there are more than two caches or archives, add lines corresponding to
these. Also modify the domain to be the correct one everywhere. (EG
.uta.edu)
- Make
the same changes to ~/bin/detach_node.template and rename it as ~/bin/detatch_node.
- Perform
the following three steps as root user.
- Copy
the following files from ~/bin/ to /scratch/localbin:
- attatch_node
- detatch_node
- rootman
- rootreq
- Make
these files owned by root.root and executable.
- As
root user, add the start_rootman script to /root, and then append
the contents of this script to the file /etc/rc.d/rc.local so that
the root manager daemon can start up at boot time.
- Copy
the file ssh_status_capture from ~/bin/ to /scratch/localbin,
and make it owned by mcfarm.mcfarm, and make it executable. This script is
a workaround for a current problem with ssh under RH 7.1 python.
- Modify
the files ~/bin/backup_code.template and ~/bin/backup_meta.tempalte
to contain the correct server number, and rename these files as you did
before by getting rid of the ‘.template’..
- Modify
the ~/bin/setup_farm.template script by setting the appropriate
values for these variables:
- FARM_ACCOUNT
- FARM_SRT_SUBDIR
- FARM_NETWORK_NAME
- FARM_SERVER_NODENAME
- FARM_DOMAIN_NAME
- FARM_EXTERNAL_NAME
- FARM_ARCHIVE_JOBS_QUEUE_NN
(nn anything from 00 to 99):
At least one of this is requires. Make it point to the full path name of
a new archive directory. The one to be set to if all else fails is $FARM_BASE/archive/jobs.
- FARM_FILESERVER_CACHE_NNN_A
(NNN = job server node number).
Make this point to $ FARM_CACHE’NNN_A. Also identify
the exact partition name wherever the directory points to. Repeat these
pairs of lines if there are other cache partitions on the job server.
- FARM_SCRATCH_ACCESS_NNN=
(where NNN is the job server node number). Does not have to have any content.
- FARM_CACHE_ACCESS_NNN==
(where NNN is the job server node number). Does not have to have any content.
- Modify
the ~/bin/start_farm.template and ~/bin/stop_farm.template
scripts to contain the correct node number for the server, and save these
as before without the .template.
- Modify
the ~/distribute.conf.template file to have a line for the server –
see example in that file. This line should contain a comma separated list
of values in the following manner: “node=xxx,max=x,partition=xxx,nodename=xxx.xxx.xxx”.
(To find out the partition, use the df command, and look at the Filesystem
column). The max=x command tells mcfarm how many jobs may be
distributed to this machine at one time. This has to be equal to the
number of CPU’s on that machine. If it is zero, no jobs will be
distributed to that macine. Once again rename this file without the .template.
- Modify
the file ~/conf_files/uta.basic.template as follows and then rename
this file as xxx.basic, where xxx is the same value as FARM_EXTERNAL_NAME.
This should contain appropriate values for the following variables (The
values can be obtained from Iain Bertram:
- OriginName
- FacilityName
- ProducedForName
- ProducedByName
- GroupName
- The
farm periodically sends out status information on running jobs, errored
jobs and gathered jobs to designated individuals through e-mail. Please
put the e-mail addresses of those individuals that are supposed to receive
mail from your farm in the ~notify_routine.template file, and
rename this file as before. This file is for any routine information that
may be required. In order to alert farm administrators about any problems
that may be encountered during farm operation, the ~notify_alert.template
needs to contain the email addresses of farm administrators and
individuals who are required to know about these alerts. Once again this
file needs to be renamed as before.
- Modify
the .bashrc file in the mcfarm directory to perform certain
tasks on startup by the adding the following lines as follows:
- Source
the fnal scripts for UPS / UPD:
source /fnal/ups/etc/setups.sh
- Setup
the Farm Environment Variables: . /home/mcfarm/bin/setup_farm (Note
that the previous command is executing a shell script i.e. it is
“. SPACE command”.
- Now
reboot the server. This will call the scripts from the .bashrc file
and will also start the root daemons. Then run the start_farm to initiate the daemons for locking,
execution, gathering, distribution and monitoring.
RUNNING A TEST JOB:
You now
have a farm of one node, which acts as a job-server, file server and production
node all rolled into one. Now you can run a test minbi production job to see if
everything was setup right.
- Make
sure that the ~distribute.conf file has max=1 for this node
so that jobs are being distributed to this node.
- Modify
the ~conf-files/minbi-cdf.script.template file to make sure that it
has the proper values for D-release, Cardfile version and UseMaxOpt.
Rename it as before without the .template at the end.
- Run
the ~conf-files/samples/make-minbi script to run a 1000 event
pythia job. You can monitor its progress by using the command jobstat
–av. It will get distributed, run, and ten will finally be gathered to
the only cache that you have now. You can check this by doing ls
/cacheJJJ_A to see the output there.
- Change
the ~/distribute.conf file back to max=0 for the job server (unless you
plan to allow production jobs to run there also).
NOTE: For running production type jobs, see the note on “How
to Submit jobs to McFarm Control system” after you have complete building your
farm.