McFarm Planning
If
you are considering to implementing a DØ Monte Carlo farm using UTA’s McFarm software, here are some basic
decisions to be made in order to determine the configuration and resources you
will need. Once these decisions have
been made, and the basic Linux installation and configurations have been
completed on some of the nodes, you can implement the McFarm software package
and DØ binaries. The UTA team will be
happy to assist you in making these decisions – just give us a basic inventory
of your resources and your farming goals (production and/or local consumption).
Each CPU that is to run the DØ binaries
(on a production node) should have
the following as minimum requirements:
Batch queues may be used (if McFarm
“knows about” your particular queuing software – we are presently working on
integrating LSF and PBS with McFarm).
However, you should not plan to run non-farm tasks simultaneously with farm tasks, since the DØ binaries will
generally soak a CPU (99% utilization).
You can efficiently configure the queues to run farm and non-farm batch
tasks in alternating mode.
One of the nodes must be designated as
the job server. For farms of up to 50 or so nodes, job
creation, serving, and monitoring is a part-time task for this node. It is convenient to also configure this
machine to be a gather, which means
it will also be asked to forward the output of the production nodes to cache
disks and/or to SAM storage. Thus, you
can expect one CPU of the job server node to be mostly dedicated to these two
(job and gather) service functions.
This machine must hold the DØ binaries, mc_runjob, and possibly some
substantial archive information, so 4GB is a minimum for the partition that is
to be exported. The job server does not
have to have a particularly fast CPU as most of its tasks are I/O intensive.
Minbias data will be needed to run
DØSIM. The normal practice is to
generate a large set of minbias events each time a new release of DØgstar is
available then randomly access this dataset for the life of that release. A substantial number of events are needed to
avoid systematic bias in your minbias data addition. UTA uses 100,000 minbias events, requiring about 50GB of storage,
and spread out over 50 separate files for a safe simultaneous access. The more production nodes you have accessing
this data, the more important it becomes to spread it out over multiple file servers so that the server NIC port
does not become a bottleneck. UTA has
used 3 file servers to service 50 CPUs and observed about 80+% utilization of
the production CPU during DØSIM. File
servers should not do production tasks, so use your slowest CPUs for file
serving. Configure (each) file serving
partition separate, so it can easily be exported to other nodes. Determining how many file servers you need
involves a tradeoff – the more nodes you dedicate for file serving, the better
performance you will see from the production nodes when they do DØsim tasks
(which is about 20% of the entire
production time), but the fewer nodes you will have running production.
If you are planning to cache the farm output for your own consumption (e.g., root tuples), then you may use the file servers for this purpose as well by increasing the amount of disk space on them for an additional storage. You can implement multiple partitions on these nodes, and McFarm will see them.
A third use of file servers is to archive your completed jobs (excluding output). McFarm can use these archives for useful production statistics, or you may need them to investigate completed job parameters. Archives can be kept for a certain number of days (e.g., 90) or indefinitely (up to the life of that release, for example). Allow a few GB for job archives if possible. It can be kept on the job server or on one or more file servers.
Given the need for many production
nodes to see the file servers efficiently, it is desirable to implement the
farm on a private, dedicated network switch
if at all possible. A switch allows two
different production nodes to communicate with two different file servers
simultaneously (as well as gather servers, etc), and without one the file
serving may become a bottleneck for DØSIM.
The farm will run without a switch, but you will likely experience lower
efficiencies on DØSIM.
If you have control over the network names of all the farm nodes, use a naming scheme such as hepfmNNN where NNN is a number from 000 to 999. If you have to use nodes that are already named, then a configuration file will identify their network name to McFarm.
If you have control over the IP addresses, fixed IPs are more efficient since you can construct a host file and avoid the use of name servers during farm operation, making the farm operation more self-contained. The farm will run with dynamically-assigned IP’s.
·
Farm operation
account (mcfarm)
You will need a Linux account for farm
operation. The account name mcfarm is recommended
As described in the McFarm
implementation documents, you will have to implement NFS (or equivalent)
throughout the farm, and this appears to also require the farm to be a NIS/YP
domain. NFS is used to (a) provide the
DØ binaries to the production nodes from the job server, (b) access minbias
data from the production nodes running DØSIM, and (c) in certain McFarm
functions such as gathering and monitoring.
Each farm node should have one or more partitions that are dedicated to
farming, to facilitate the export of
farm data to other farm nodes.
·
SSH
SSH (or at least RSH) needs to be
implemented on all the nodes in the farm for McFarm operation.
·
SAM
Full SAM stations need to be installed
to store output in SAM, and/or to acquire files from SAM for processing. For a farm of over 40 nodes, you may wish to
implement SAM on more than one node (an additional gather server) to keep up with the output. This is another reason for a network switch
in the farm. Once the farm is running,
examine your network connection for bottlenecks between the farm and FNAL and
address problems as necessary.
·
McFarm
Bookkeeper and Request Manager
McFarm handles jobs such as pythia and DØgstar, but Monte Carlo production
generally starts with a request such
as “10,000 top events”. The McFarm Request Manager is a front-end to McFarm
to convert a request into one or more jobs.
The Bookkeeper tracks the
completion status of those jobs so you can see the status of the request
itself, posting its progress periodically to a web page.
You will also be asked to install the
Globus on a gatekeeper machine. This is
needed if you choose to allow your farm’s progress to be posted to a central web page. Additional functions are planned that will
use the gatekeeper as McFarm becomes more grid-enabled. Therefore while it is not absolutely
necessary for you to install Globus on a gatekeeper machine, we strongly recommend
doing this for more grid enabled McFarm operations.
A very large farm, say 100 nodes or more, may require more complicated configuration. The UTA team will be happy to assist. Note that there is probably only a small loss of production associated with having two small farms instead of one large farm, since only one additional node is needed for overhead (another job server) but you may gain access to another switch in the process.
·
Next
Additional documents will describe how to install base software such as Fermi Linux, SAM, SSH, NFS, NIS/YP, Globus, the DØ binaries, and mc_runjob. Other documents will show how to create the McFarm job server, production nodes, file servers, and gather servers, and the issues for the NFS mounts needed to support the farm. There is also a complete McFarm operations manual, and a bookkeeper manual. There will be a Web page links on the D0RACE home page (http://www-hep.uta.edu/~d0race) with these document, download link, and FAQ soon. If you wish to proceed, please contact the UTA team for further assistance. We would like to known your planned configuration so we can more effectively support it.
The team
members are:
|
Name |
e-mail |
Phone Number |
|
Anand Balasubramanian |
|
|
|
Amruth Dattatreya |
|
|
|
Karthik Gopalratnam |
|
|
|
Drew Meyer |
|
|
|
Mark Sosebee |
(817)272-2456 |
|
|
Tomasz Wlodek |
|
|
|
Jae Yu (Primary Contact) |
(817)272-2814 |
You can
contact any one of the team members above to obtain necessary support but it is
suggested to use the primary contact for a proper channeling.