Weekly Statistics

 
Week

 

 Execution  Site
 Number of local jobs
  Num of Events
 Local jobs  Completed 
Jobs Failed (Causes)**
  Success Rate* 
   (Events  Stored/Events Submitted ) 
Success Rate*
  (Jobs Completed/Jobs Crashed )
Comments
02/02 to 02/08
 Lyon
  220
  55,000
   124
        84 (1,2)
         56.36 %
55 %
 
   Manchester
 200
   50,000
     172
        28
        86%
86%
 
  Wisconsin
 200 
   50,000
 90/154(rest were killed)
        64
        58.4 %
  58.4 %
 
02/09 to 02/14
Lyon
118
29,500
66
52(1a, 4b)
55.90 %
55.90 %
Out of the 52 jobs failed , 50 (96%) were killed by the batch system (CPU time exceeded) and 1(2%) was unsucessful to store files (duplicate name). 1(2%) was killed by the user !!!. 
Manchester
200
50,000
154
46(4a, 4b, 4c)
77%
77% 
12 Jobs got stuck in pythia (under investigation) 
Wisconsin
100
23,500
36
25(1c,4b)
62.06%
62.06%
36 (36%)jobs crashed because (jim_gridftp was stopped) ,  human error. 5 (5%) beacuse of duplicate file names. 20 (20%) core dumps , 6(6%) still hanging around.
02/15 to 02/22
Lyon
100
25,000
81
19
95.29%
95.29%
15 local jobs were killed due to manual error (sam.tgz replaced with a newer buggier one , while local jobs were executing)
Manchester
200
50,000
169
31
84.5%
84.5%
Jobs died because of core dumps and other configuration issues.
Wisconsin
100
25,000
83
8 (4c)
91.2%
91.2% 
9 jobs killed because test jobs were blocking the actual production request (joel's request), 8  died because of duplicate file names.
02/22 to 03/09 
Lyon
107
26,750
98
09
93.45%
93.45%
Jobs failed due to Network Communication  failure & d0gstar crashing
Wisconsin
59
14,750
48
11
81.35%
81.35%
Network Communication Failure (10) coredumps(1)
Manchester
---- 
 ----
----
----
----
----
Becuase of node contention and problems with the head node, sufficient statistics could not be gatherd for this period.
03/15 to 03/18 
Lyon
80
20,000
     79
01(2)
98.75%
98.75%
1 Job failed because SAM project limit was exceeded
Wisconsin
139
34750
127
12(4)
91.36%
91.36%
- d0gstar crashes ( 7)
-pythia crashes (5)
Manchester
181
 45250
169
12(4)
93.37%
93.37%
- 6 d0sim crashes
-2 d0reco crashes
-4 d0gstar crashes
 
* The Success Rate does not reflect the jobs that resulted in failures due to site misconfiguration and/or human errors.
**1= Site specific issue, 2= Jim or Sam, 3=Middleware issue,  4=Dzero,mc_runjob ...

Summary Statistics (from 01-01-04 to 03-19-04)


Events produced (per execution site)
Wisconsin  = 199,558
Lyon       > 125,455
Manchester > 110,000
           ----------
Total      > 435,013

Number Of Grid Job Submitted (per submission site):
samgrid.fnal.gov : 855    
ccd0.in2p3.fr    : 281
luhep02.lunet.edu:  63
               ----------
Total              1199

Number Of Local Job Submitted (per execution site):
 site       #_local_jobs  CPU_time(h) Sys_time(h) CPU_norm_time(h) real_time(h)
Wisconsin      3546                                                 23423
Lyon           2695         8696       339           106840         28349
Manchester    >1500
            ------------
              >7741

                                                                                                    Grid Efficiency*
                                                                                                                                                                                                                      (From 03-19-04 to 04-13-04)

Avg. Number of Events Requested per week  = 73, 333
Avg. Number of Events Produced  per week  = 65,  451

Total Number of Events Requested                 = 220,000
Total Number of Events Produced                  = 196,353
Efficiency of  the system over (25 days)         =  89.251%
                                                                                                                                                 (From 04-14-04 to 04-27-04)
Total Number of Events Requested          = 182,250
Total Number of Events Produced           = 172,550
Efficiency of  the system over (14 days)    = 94.46 %
                                                                             

* Efficiency is defined as Number of Events Requested / Number of Events Produced

SAM-Grid Deployment: Issues

This page lists the issues encountered during the deployment of the SAM-Grid.
The issues are categorized as related to
  1. Site specific
  2. JIM or SAM
  3. Middleware (Condor-G, Globus, ...)
  4. DZero code, mc_runjob, users

Site specific

Wisconsin

    a. Intra-cluster file transfers from the sam cache
    Open Jan 26
    Solution: we start up a dedicate gridftpd where the sam cache is. The clients will keep using the functions DhGet and DhGetRandom, which use sam_cp. sam_cp will be extended to use a well configured gridftp client.
     Closed Feb 20
    b. Jobs sit idle in the queue for a long time
    Open Jan 19
    Solution: our jobs request dedicated machines, not to be evicted. Raising the user priorities could make the situation better
    Closed 04/27/2004
    Using glow cluster in addition to the "p" cluster on an average now we get around 100 machines.
     
    c. Large fraction of jobs core dump
    Open Feb 12th
    Please refer DZero code, mc_runjob, users
    Closed 04/27/2004
    Not observed any more
    d. mc_runjob fails to exit
    Open Feb 13
    Trying to find the exact reason why mc_runjob fails to detect the completion of the job.
    Many jobs finish successfully (storing reco file in SAM) but for some reason mc_runjob (Monitor Thread of mc_runjob) cannot detect job completion. This results in a lot of CPU's being held up by mc_runjob and decreases overall throughput.

    e. Problem with dynamically started gridftp server
    Open Mar 6th
    Dyanmically started gridftp server fails because of a bug in globus.
    Solution: Moving away from the dynamically started gridftp server to a static one at the same time having the option of starting one if needed.
    Closed Mar 9th
     

IN2P3

    a. Some jobs exceed the max cpu limit of the longest queue
    Open Jan 19
    Investigating if it is related to a class of Requiest Ids.
    Closed Feb 09
    Investigation complete, CPU time limit exceeded due to particular request Id's (e.g. 10084,11249) for which run  d0gstar for 18 odd hours. Moved to a new queue which has larger CPU time limit

    Open Mar 30 - 2004
    The temporary area ("/tmp") was cleaned up somehow. As a result lost the status of files of polling scripts because of which the jobs got held at the grid level. Investigating how/who deleted the files in "/tmp" area. Possible solution is to use a separate area for creating temprary files
    Closed 04/27/2004
    Now we use a seperate area for creating tmp files

Manchester

    a. Clocks of the worker nodes are out of synch
    Open Jan 26
    The failure are due to the validity of the proxies used by the gridftp client.
    Solution: the clients sleep until the proxy become valid, if the clock is behind. We are also working with Sabah to fix the ntpd configuration at the cluster
    Closed 04/27/2004
    We sleep the difference in current time and proxy start time

    b. SAM station disk gets oftern deactivated Open Jan 28

    Under investigation.
    Closed 04/27/2004
    Not observed any more.
     

    c. SAM station is often restarted, our ptojects are killed 
    Open Jan 29
    Started intercepting mail from manchester station. Those messages went to wrong addressees. MAN folks in denial.
    Closed 03/18/2004
    Station has been operating stably for over a month now

    d. The dynamically started gridftpd dies while the job is still running
    Clients hangs forever in case the gridftpd is not available. Under investigation.
    - This was because of a bug in old sam gridftp which Gabriele has fixed in the new release
    Closed Feb 20th
    e. No file can be stored in Enstore because of an eworker bug (old stager version is running).
    Open 02/02/2004
    Ugraded their sam_station to a newer version.
    Closed Feb 18
    f. Domain name cannot be determined from the worker nodes
    As a result of this, sam_cp resorts to a normal cp ( sam_cp needs the domain name to map it to a protocol).
    This results in the file transfer being slow.
    Solution: Configuring the nodes of the cluster properly
    Closed 02/27/2004
    if we specify node name prefix in sam_cp_config.py , it works. 
    g.  qstat command for polling fails because of failure to connect to server
    Solution : Incorporate retrials.
    Closed Mar 02

    JIM or SAM

    a. Failure to resolve true user name due to DBS communication error

    Open Jan 26
    The clients fail when trying to get information from the sam db server. We are currently gathering statistics on the impact of this problem using different db servers.
    Depending on site, this eats up 40-60% of jobs. The problem only exhibits on sites where user jobs run under usernames other than those at FNAL. Solution: eager approach: get the information needed as soon as possible (e.g. get the sam user when submitting the job). Introduce retrial blocks.
    Closed Feb 17

    b. Extreme (15hrs and more) slow-down of SAM projects due to pmaster project thrashing and excessive (120K in one instance) calling to the DB, which doesn't handle the traffic.

    Open Jan 29 There are two projects that are run for each job:
      We keep gathering the information about this.
    Closed Feb 23

    c. Stopping of the main SAM gridftp server results in stopping of ALL the grid_ftp servers, including those run by JIM.

    Fixed in the new release of sam_gridftp.
    Closed Feb 23

    d. Project limit is spuriously exceeded at a SAM station

    This is due to new projects starting before station has a chance to clear old projects in the DB, which happens whenever communication to the DB is slow. Bandaid: increase project limit.

    Now we use a new way (sam dh get lite) of getting files that doesn't require to run projects to get already cached files - Thanks to the SAM Team
    Closed April 27th

    e. SAM Db server is hogged by queries to resolve snap id to file list.
    Closed April 27th
    A new optimization in Sam db server has reduced the time to this resolution down to fractions of a second

    f. SAM db server is hogged when a dataset definition consisting of 1000 files is created.
    Open April 27th
    Possible solutions are optimizing the way by getting rid of the OR operator. Investigation continues ...

Middleware

    a. Condor client doesn not work from a lap top (DHCP)

    Open Jan 21 - Closed Feb 05
    See bug #8849.
    Condor v6_6_0 fails to resolve the local hostname. Seems related to dhcp. Condor v6_5_0 was working. Under investigation.
    Resolution: condor must be able to reverse-lookup the IP of the laptop via the DNS server, even when dhcp is used. Also, /etc/hosts must not have a wrong IP entry for the laptop name.

    b. Condor scheduler never releases the job after submission

    Open Jan 08
    See bug #8914 and 8694.
    The condor scheduler puts the job in hold while it transfers the input sandbox from the client. Sometimes it never releases it. Seems related to GSI.

    c. Condor client v6_6_0 cannot use network aliases for the condor scheduler

    Open Jan 21
    See bug #8849.
    Condor client v6_5_0 was working.

    d. Condor schedd fails to receive any jobs (at least from the user at hand) if a single GSI-related mishap occurred.

    Example: Condor incorrectly handles the X509_USER_CERT and X509_USER_KEY variables, setting those to point to user credentials will disable any communication between the user and Condor. Solution: kill schedd and resubmit all jobs. (is this an overkill?)

    e. X509_USER_PROXY env variable is reset by  condor libraries for structured jobs (condor_dagman)

    Open April 13th
    The present condor_dagman binary mandates that the user proxy be present in the /tmp area with the name x509up_uXXX, where XXX is the
    unix uid of the schedd owner. This behavior is known to the Condor team and they are coming up with a solution to this problem.

    DZero code, mc_runjob, users

    a.  A particular phase of d0 code goes into an infinte loop
    Open Feb 17th
         On Manchester the pythia phase just seems to go in an infinite loop, under investigation.
     

    b. Core Dumps (faulty error handling in mc_runjob)

      Open Feb 12th
      It's been observed that on the Wisconsin site (RH9) there are a large number of core dumps by d0gstar, d0sim etc.
      The code to handle core dumps is present in mc_runjob, but probably is untested and incorrect. Thus jobs which dump core,exit as follow

      Exception in thread Thread-2:
      Traceback (most recent call last):
        File "/scratch/condor/execute/dir_19830/sam_client/python/lib/python2.1/threading.py", line 378, in __bootstrap
          self.run()
        File "runtime_utils/ExecutionThread.py", line 232, in run
          self._RunManagers(task_dir)
        File "runtime_utils/ExecutionThread.py", line 160, in _RunManagers
          mgr = runManagerObject(item,item,dirname,self)
        File "runtime_utils/BaseManager.py", line 16, in runManagerObject
          exec instantiator
        File "<string>", line 1, in ?
        File "runtime_utils/FileManager.py", line 22, in __init__
          self._CoreCheck()
        File "runtime_utils/FileManager.py", line 214, in _CoreCheck
          event.setType('core')
      UnboundLocalError: local variable 'event' referenced before assignment

      This has been observed more often at Wisconsin(Could RH9 be the reason for a large number of core dumps ? ).
      Investigation revealed that d0sim dumps

      C++ runtime abort: terminate() called by the exception handling mechanism
      ./run_d0sim.sh: line 18: 24852 Aborted                 (core dumped) $EXENAME -rcp d0sim/rcp/runD0Sim.rcp

      and d0gstar dumps

      Incorrectly built binary which accesses errno or h_errno directly. Needs to be fixed.
      ./run_d0gstar.sh: line 25: 21221 Segmentation fault      (core dumped) $EXENAME -rcp simpp/rcp/sim_framework_mckine.rcp <d0gstar.inp

      the above error might due to incompatibilities between RH9 and RH7.x  glibc component.
      Please see (http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=89286) &
      (http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=90002)

      c. Duplicate filename 

      Open Feb 11th mc runjob genrates files with a name that has been already declared by some other local job of the same,4 grid bunch.
      This is due the fact that mc_runjob uses time as the base for creating a unique filename (called workid in the metadata).
      Closed Feb 27th  ( Now using the Uniqueness parameter in mc_runjob)

      d. Requests using Alpgen

      Such requests cannot be run right now. Such requests d0gstar takes input from SAM and generally these input files are huge 50,000 events.
      Solution: One solution is to have the user split this huge file into smaller chunks (say 250 events in each file) and fix mc_runjob to pull the file from SAM
      Closed Feb 27th ( If the generated phase is already produced no matter the number of events , We can pull it from SAM)

      e. d0mess

      The problem of using d0mess with 250 events job is that most of the jobs will not produce any data at all and will exit after running pythia
      Solution: Its OK if some of them dont produce any data.
      Closed Feb 27th

      f. Improperly formed requests

      Many requests don't have d0reco configurator specified in the request details. If run these will not produce any data.
      Solution: Add the d0reco configurator to the macro after preprocessing ,but we prefer that the users create requests responsibly.
      Closed Feb 27th

      g. Explicit specification of the framework RCP file in the request breaks mc_runjob.

      This is due to SAM DB forcing conversion of MC parameters to lowercase (according to Iain).
      Solution: wipe out the framework_rcp line from the macro.
      Closed Feb 27th

    Page Started on Jan 28, 2004
    $Id: deployment-issues.html,v 1.31 2004/04/13 22:35:22 aditya Exp $