MCFARM PRODUCTION
NODE PREPARATION WITH AUTOMATIC INSTALL SCRIPT
This document describes the preparation of an MC production node.
Prerequisites to setting up a production node (Referred
to as node number NNN):
Steps to be taken before configuring NNN:
1. On job server JJJ, make sure that the /etc/exports
file contains in it a line for exporting the /home directory to NNN like
this:
/home FARM_NAME_NNN(rw,no_root_squash)
2. On EACH file server FFF, the /etc/exports
file should contain a line for each cache disk that needs to be exports to
the node NNN:
/scratch
FARM_NAME_NNN(rw,no_root_squash)
NOTE: It is VERY important that the above steps are completed correctly before proceeding to the configuration of NNN itself!
Configuring Node NNN as an mcfarm Production Node:
Then run this script as a shell script as “. SetupEnv”.
If there were any problems during installation using this script, it will exit and in general will cleanup after it so that the system is back in the same state before you ran the script. If however you want to clean up for some reason after running the script successfully the following command will accomplish that: “python ProdNodeSetup cleanup”. This will restore the state of the system before you ran the install script.
4. Modify the file /home/mcfarm/bin/attach_nodes_to_js
to include this node also.
5. On the
job-server JJJ and on each gather-server GGG, make the following mount
directories and links:
mkdir /mnt/hepfmNNN
mkdir /mnt/hepfmNNN/scratch
mkdir /mnt/hepfmNNN/cache_A
mkdir /mnt/hepfmNNN/gath_queue
ln -s
/mnt/hepfmNNN/scratch
/scrNNN
ln -s
/mnt/hepfmNNN/cache_A
/cacheNNN_A
ln -s
/mnt/hepfmNNN/gath_queue
/gatherNNN
chown mcfarm.mcfarm /mnt/hepfmNNN/*
chown mcfarm.mcfarm /scrNNN
chown mcfarm.mcfarm
/cacheNNN_A
chown mcfarm.mcfarm /gatherNNN
Then, make sure that NNN exports its /scratch
directory to JOB_SERVER_JJJ, by adding the following line to the /etc/exports
file as root user:
/scratch JOB_SERVER_JJJ(rw,no_root_squash)
Then issue the command
exportfs –ar to place this new file in play.
Then if you did NOT implement the root daemon,
you must issue these manual mounts (now, and each time this server is booted):
mount
-t nfs -o
rw,rsize=16384,wsize=16384,actimeo=0,intr
hepfmNNN.uta.edu:/scratch
/mnt/hepfmNNN/scratch
mount
-t nfs -o
rw,rsize=16384,wsize=16384,actimeo=5,intr
hepfm009.uta.edu:/cacheNNN_A
/mnt/hepfm009/cache_A
mount
-t nfs -o
rw,rsize=16384,wsize=16384,actimeo=5,intr
hepfm009.uta.edu:/scratch/gath_queue
/mnt/hepfm009/gath_queue
If you DID implement the root daemon, then these mounts are performed
using this command as mcfarm on the job server:
root_command $FARM_SERVER_NODENAME -- script=$FARM_BIN/attach_nodes_to_js
and for each gather-server GGG
root_command hepfmGGG
--script=$FARM_BIN/attach_nodes_to_gs_GGG
which will mount this new node (and all old nodes, harmlessly) on the
job and gather servers.
Either way, from the server you should now be able to do ls /scrNNN and see all the contents of the nde node’s /scratch
directory. Same thing for ls /cacheNNN_A and ls
/gatherNNN (test these
mounts by placing a file into the target directories).
6. Modify /home/mcfarm/distribute.conf
to include a line as follows for the node NNN:
node=NNN,max=0,partition=hda7,nodename=hepfmNNN.uta.edu
Make sure that NNN is the node number and that the partition is correct.
(The partition can be ascertained by doing df
/scratch on NNN).
7. When you
are ready to allow farm tasks to be sent to this new node (you have verified
the SSH functions from the job server and all gather servers, and you have
verified NFS links from the job server, gather servers, file servers, and the
node itself), then do these two steps to tell the farm distribute daemon to
send jobs to the new node:
Modify the distribute.conf file to set max=M, where M is the number of
CPUs on the new node that are to receive
farm work. Do not exceed the actual
number.
If the farm itself is already running, then from
the job server, issue
start_execute NNN
If the farm itself has not been started yet, issue start_farm as mcfarm.
NOTE:
If you are about to turn this node into a file-server, you probably do not want
to allow production jobs to be sent because they will compete for time with the
serving of files. In such a case,
leave max=0 in the distribute.conf file.
If you do try to send jobs there, watch the future performance of the
other nodes when they run d0sim, which uses the file servers for minbi data.