McFarm Planning

 

If you are considering to implementing a DØ Monte Carlo farm using UTA’s McFarm software, here are some basic decisions to be made in order to determine the configuration and resources you will need.  Once these decisions have been made, and the basic Linux installation and configurations have been completed on some of the nodes, you can implement the McFarm software package and DØ binaries.  The UTA team will be happy to assist you in making these decisions – just give us a basic inventory of your resources and your farming goals (production and/or local consumption).

 

Each CPU that is to run the DØ binaries (on a production node) should have the following as minimum requirements:

  1. OS: RH 7.x   (RH6.2 will work at this time, but D0 may drop support in the future).
  2. Memory: Minimum 96MB/CPU; Less memory will significantly impede the performance of the farm and could even cause crashes.
  3. Disk space: Minimum 2GB/CPU of local disk space for farm work, in a partition that is separate from the root partition.  Less disk space will cramp production since new output files have to be stored at least temporarily on local disks until they can be moved to caches and/or stored in SAM, and you want to be able to start new jobs before such dispositions occur.  More disk space will give you more production time in the long run.  Place the disk space in a separate partition (named /scratch) if possible to facilitate the NFS export of it to other nodes. 
  4. For multi-CPU nodes, scale the above figures by the number of CPUs.
  5. Node selections: If you have a choice, make your fastest CPUs into production nodes.  These nodes do not require a dedicated monitor if they are used only for farm work.

 

Batch queues may be used (if McFarm “knows about” your particular queuing software – we are presently working on integrating LSF and PBS with McFarm).  However, you should not plan to run non-farm tasks simultaneously with farm tasks, since the DØ binaries will generally soak a CPU (99% utilization).  You can efficiently configure the queues to run farm and non-farm batch tasks in alternating mode.

 

One of the nodes must be designated as the job server.  For farms of up to 50 or so nodes, job creation, serving, and monitoring is a part-time task for this node.  It is convenient to also configure this machine to be a gather, which means it will also be asked to forward the output of the production nodes to cache disks and/or to SAM storage.  Thus, you can expect one CPU of the job server node to be mostly dedicated to these two (job and gather) service functions.  This machine must hold the DØ binaries, mc_runjob, and possibly some substantial archive information, so 4GB is a minimum for the partition that is to be exported.  The job server does not have to have a particularly fast CPU as most of its tasks are I/O intensive.

 

Minbias data will be needed to run DØSIM.  The normal practice is to generate a large set of minbias events each time a new release of DØgstar is available then randomly access this dataset for the life of that release.  A substantial number of events are needed to avoid systematic bias in your minbias data addition.  UTA uses 100,000 minbias events, requiring about 50GB of storage, and spread out over 50 separate files for a safe simultaneous access.  The more production nodes you have accessing this data, the more important it becomes to spread it out over multiple file servers so that the server NIC port does not become a bottleneck.  UTA has used 3 file servers to service 50 CPUs and observed about 80+% utilization of the production CPU during DØSIM.  File servers should not do production tasks, so use your slowest CPUs for file serving.  Configure (each) file serving partition separate, so it can easily be exported to other nodes.  Determining how many file servers you need involves a tradeoff – the more nodes you dedicate for file serving, the better performance you will see from the production nodes when they do DØsim tasks (which  is about 20% of the entire production time), but the fewer nodes you will have running production.

 

If you are planning to cache the farm output for your own consumption (e.g., root tuples), then you may use the file servers for this purpose as well by increasing the amount of disk space on them for an additional storage.  You can implement multiple partitions on these nodes, and McFarm will see them.

 

A third use of file servers is to archive your completed jobs (excluding output).  McFarm can use these archives for useful production statistics, or you may need them to investigate completed job parameters.  Archives can be kept for a certain number of days (e.g., 90) or indefinitely (up to the life of that release, for example).  Allow a few GB for job archives if possible.  It can be kept on the job server or on one or more file servers.

 

Given the need for many production nodes to see the file servers efficiently, it is desirable to implement the farm on a private, dedicated network switch if at all possible.  A switch allows two different production nodes to communicate with two different file servers simultaneously (as well as gather servers, etc), and without one the file serving may become a bottleneck for DØSIM.  The farm will run without a switch, but you will likely experience lower efficiencies on DØSIM.

 

If you have control over the network names of all the farm nodes, use a naming scheme such as hepfmNNN where NNN is a number from 000 to 999.  If you have to use nodes that are already named, then a configuration file will identify their network name to McFarm.

 

If you have control over the IP addresses, fixed IPs are more efficient since you can construct a host file and avoid the use of name servers during farm operation, making the farm operation more self-contained.  The farm will run with dynamically-assigned IP’s.

 

·         Farm operation account (mcfarm)

You will need a Linux account for farm operation.  The account name mcfarm is recommended

 

As described in the McFarm implementation documents, you will have to implement NFS (or equivalent) throughout the farm, and this appears to also require the farm to be a NIS/YP domain.  NFS is used to (a) provide the DØ binaries to the production nodes from the job server, (b) access minbias data from the production nodes running DØSIM, and (c) in certain McFarm functions such as gathering and monitoring.  Each farm node should have one or more partitions that are dedicated to farming, to facilitate the export of farm data to other farm nodes.

 

·         SSH

      SSH (or at least RSH) needs to be implemented on all the nodes in the farm for McFarm operation.

 

·         SAM

Full SAM stations need to be installed to store output in SAM, and/or to acquire files from SAM for processing.  For a farm of over 40 nodes, you may wish to implement SAM on more than one node (an additional gather server) to keep up with the output.  This is another reason for a network switch in the farm.  Once the farm is running, examine your network connection for bottlenecks between the farm and FNAL and address problems as necessary.

 

·         McFarm Bookkeeper and Request Manager

McFarm handles jobs such as pythia and DØgstar, but Monte Carlo production generally starts with a request such as “10,000 top events”.  The McFarm Request Manager is a front-end to McFarm to convert a request into one or more jobs.  The Bookkeeper tracks the completion status of those jobs so you can see the status of the request itself, posting its progress periodically to a web page.

 

You will also be asked to install the Globus on a gatekeeper machine.  This is needed if you choose to allow your farm’s progress to be posted to a central web page.  Additional functions are planned that will use the gatekeeper as McFarm becomes more grid-enabled.  Therefore while it is not absolutely necessary for you to install Globus on a gatekeeper machine, we strongly recommend doing this for more grid enabled McFarm operations.

 

A very large farm, say 100 nodes or more, may require more complicated configuration.  The UTA team will be happy to assist.  Note that there is probably only a small loss of production associated with having two small farms instead of one large farm, since only one additional node is needed for overhead (another job server) but you may gain access to another switch in the process.

 

·         Next

Additional documents will describe how to install base software such as Fermi Linux, SAM, SSH, NFS, NIS/YP, Globus, the DØ binaries, and mc_runjob.  Other documents will show how to create the McFarm job server, production nodes, file servers, and gather servers, and the issues for the NFS mounts needed to support the farm.  There is also a complete McFarm operations manual, and a bookkeeper manual.  There will be a Web page links on the D0RACE home page (http://www-hep.uta.edu/~d0race) with these document, download link, and FAQ soon.  If you wish to proceed, please contact the UTA team for further assistance.  We would like to known your planned configuration so we can more effectively support it.

 

The team members are:

 

Name

e-mail

Phone Number

Anand Balasubramanian

abalasub@cse.uta.edu

 

Amruth Dattatreya

amruthd@hotmail.com

 

Karthik Gopalratnam

karthik@hepfm000.uta.edu

 

Drew Meyer

ddmeyer@attglobal.net

 

Mark Sosebee

sosebee@uta.edu

(817)272-2456

Tomasz Wlodek

tomw@utalf7.uta.edu

 

Jae Yu (Primary Contact)

yu@fnal.gov

(817)272-2814

 

You can contact any one of the team members above to obtain necessary support but it is suggested to use the primary contact for a proper channeling.