> One of the test jobs submitted to the OSG-NERSC cluster was demanding > more than 1 GB memory and was eventually kicked out. > inputdataset: dayset-2005-11-15-all_1-211915-1 > file: all_1_0000211915_100.raw > Jobfiles dataset: d0repro_jobfiles_p17.05.01_samgridV7-2 > > > >> I ran through all the dataset you mention below on our farm > > >> in Lyon and it was without any problem! > > >> Even the file all_1_0000211915_100.raw was done! > > >> Max memory used for this file was 705 MB. > > >> The job was running on WN with SL 3.07 and > > >> ccwl0414:tcsh[217] uname -r > > >> 2.4.21-40.ELsmp > > >> batch system BQS !!!! > > On Mon, 20 Nov 2006, Michael Diesburg wrote: > > > > > There is another possibility here. I would think that > > > any intelligent batch system would add up the memory usage of > > > the primary process and any child processes when doing a limit > > > check. We do have some monitoring and communication processes > > > running in the background along with D0reco, don't we? Is it > > > possible one of these has gotten large? Or maybe we just have > > > enough of them that our effective memory limit for D0reco on > > > a 1GB system is only ~700MB? > > > Can someone check what else might be running that could be > > > attributed to the batch process? > The batch system in question, which killed the job was Sun Grid Engine > (SGE). Do we have the feature of setting ulimit in d0runob as well? > > On Mon, 2006-11-20 at 09:01 -0600, Reiner Hauser wrote: > > On CAB and clued0 this happens when you have multiple threads for > > a single process and their stack sizes are added up (and set to > > pretty large values by PBS). We take care of that in the > > standard scripts by a 'ulimit -s 8192' at the very beginning, > > and mc_runjob has a special option to do the same. > > > > So it may be a problem depending on the batch system and how > > it accounts resources, as you say. On Condor batch systems, the image size that counts is the same as that measured by ps auxwww (which is different and bigger than the image size as shown by top, or ps -ef). Namely for a multi-threaded process it is adding all the threads together. To complicate this there are some known bugs in Condor that sometimes make it miss some of the threads. Steven Timm There are actually traces of memory usage in the reco.log files. There are lines similar to the following: %ERLOG-i mem: rss= 466.096 MB vsize= 568.943 MB I will make a pass through the p20 log files I have and see if I can come up with some kind of memory usage distribution. Mike P.S> Of course this is only the memory usage of D0reco. It won't include memory needs of any auxiliary processes. > On Tue, 2006-11-21 at 09:55 -0600, Michael Diesburg wrote: >> I hadn't been paying much attention to the p20 memory usage >> before this issue came up. I have been checking it occasionally now for >> the last few days. I have seen numerous instances of reco using ~700MB >> for a resident size and ~950MB virtual size. Certainly not everything >> is this large. I presume the larger instances are correlated with >> high luminosity but I haven't actually checked that. >> At any rate, the fact that I see them means we may be in for >> some trouble with memory limits. >> Do we know what limitations are imposed by each of the potential >> processing sites and how they calculate memory usage? I suspect we need >> to find out. Keep in mind that there is a *lot* of high luminosity >> data in the IIb data set. >> > > I will try to find out the memory limits imposed at different OSG sites. > Is there anything dzero can do to limit the memory usage in the > releases? It would also be nice to know the expected memory usage for > sure i.e. 1GB or 1.25GB etc. This will be helpful in case we have to > negotiate with the sites to give our jobs some extra affection. > Parag Mhashilkar (parag@fnal.gov) wrote: > Hi, > > The batch system in question, which killed the job was Sun Grid Engine > (SGE). Do we have the feature of setting ulimit in d0runob as well? Peter Love: No, I'll look into this... Tibor: I juste verified with our BQS batch system guys. In our case, the maximal memory used for the process is the sum of all auxilliary processes and of the process itself. And as I said max memory usage for all files in the dataset we are talking about was around 700 MB. It doesn't look like that there should be some other processes running in parallel which could be the cause of surpassing the memory limit.