Weekly Statistics
|
Week
|
Execution Site
|
Number of local jobs
|
Num of Events
|
Local jobs Completed
|
Jobs Failed (Causes)**
|
Success Rate*
(Events Stored/Events Submitted )
|
Success Rate*
(Jobs Completed/Jobs Crashed )
|
Comments
|
|
02/02 to 02/08
|
Lyon
|
220
|
55,000
|
124
|
84 (1,2)
|
56.36 %
|
55 %
|
|
|
Manchester
|
200
|
50,000
|
172
|
28
|
86%
|
86%
|
|
|
Wisconsin
|
200
|
50,000
|
90/154(rest were killed)
|
64
|
58.4 %
|
58.4 %
|
|
|
02/09 to 02/14
|
Lyon
|
118
|
29,500
|
66
|
52(1a, 4b)
|
55.90 %
|
55.90 %
|
Out of the 52 jobs failed , 50 (96%) were killed by the
batch system
(CPU time exceeded) and 1(2%) was unsucessful to store files (duplicate
name). 1(2%) was killed by the user !!!.
|
|
Manchester
|
200
|
50,000
|
154
|
46(4a, 4b, 4c)
|
77%
|
77%
|
12 Jobs got stuck in pythia (under investigation)
|
|
Wisconsin
|
100
|
23,500
|
36
|
25(1c,4b)
|
62.06%
|
62.06%
|
36 (36%)jobs crashed because (jim_gridftp was stopped)
,
human error. 5 (5%) beacuse of duplicate file names. 20 (20%) core
dumps
, 6(6%) still hanging around.
|
|
02/15 to 02/22
|
Lyon
|
100
|
25,000
|
81
|
19
|
95.29%
|
95.29%
|
15 local jobs were killed due to manual error (sam.tgz
replaced
with a newer buggier one , while local jobs were executing)
|
|
Manchester
|
200
|
50,000
|
169
|
31
|
84.5%
|
84.5%
|
Jobs died because of core dumps and other configuration
issues.
|
|
Wisconsin
|
100
|
25,000
|
83
|
8 (4c)
|
91.2%
|
91.2%
|
9 jobs killed because test jobs were blocking the actual
production
request (joel's request), 8 died because of duplicate file names.
|
|
02/22 to 03/09
|
Lyon
|
107
|
26,750
|
98
|
09
|
93.45%
|
93.45%
|
Jobs failed due to Network Communication failure
& d0gstar
crashing
|
|
Wisconsin
|
59
|
14,750
|
48
|
11
|
81.35%
|
81.35%
|
Network Communication Failure (10) coredumps(1)
|
|
Manchester
|
----
|
----
|
----
|
----
|
----
|
----
|
Becuase of node contention and problems with the head
node, sufficient
statistics could not be gatherd for this period.
|
|
03/15 to 03/18
|
Lyon
|
80
|
20,000
|
79
|
01(2)
|
98.75%
|
98.75%
|
1 Job failed because SAM project limit was exceeded
|
|
Wisconsin
|
139
|
34750
|
127
|
12(4)
|
91.36%
|
91.36%
|
- d0gstar crashes ( 7)
-pythia crashes (5)
|
|
Manchester
|
181
|
45250
|
169
|
12(4)
|
93.37%
|
93.37%
|
- 6 d0sim crashes
-2 d0reco crashes
-4 d0gstar crashes
|
* The Success Rate does not reflect
the jobs that resulted in failures due to site misconfiguration and/or
human errors.
**1= Site specific issue, 2= Jim or Sam, 3=Middleware issue,
4=Dzero,mc_runjob ...
Summary Statistics (from 01-01-04 to 03-19-04)
Events produced (per execution site)
Wisconsin = 199,558
Lyon > 125,455
Manchester > 110,000
----------
Total > 435,013
Number Of Grid Job Submitted (per submission site):
samgrid.fnal.gov : 855
ccd0.in2p3.fr : 281
luhep02.lunet.edu: 63
----------
Total 1199
Number Of Local Job Submitted (per execution site):
site #_local_jobs CPU_time(h) Sys_time(h) CPU_norm_time(h) real_time(h)
Wisconsin 3546 23423
Lyon 2695 8696 339 106840 28349
Manchester >1500
------------
>7741
Grid Efficiency*
(From 03-19-04 to 04-13-04)
Avg. Number of Events Requested per
week =
73, 333
Avg. Number of Events Produced per week =
65, 451
Total Number of Events
Requested
=
220,000
Total Number of Events
Produced
=
196,353
Efficiency of the system over (25
days) =
89.251%
(From 04-14-04 to 04-27-04)
Total Number of Events
Requested =
182,250
Total Number of Events
Produced =
172,550
Efficiency of the system over (14
days) =
94.46 %
* Efficiency is defined as Number of
Events Requested / Number of Events Produced
SAM-Grid Deployment: Issues
This page lists the issues encountered during the deployment of the
SAM-Grid.
The issues are categorized as related to
- Site specific
- JIM or SAM
- Middleware (Condor-G, Globus, ...)
- DZero code, mc_runjob, users
Site specific
Wisconsin
a. Intra-cluster file transfers from the sam cache
Open Jan 26
Solution: we start up a dedicate gridftpd where the sam cache is. The
clients will keep using the functions DhGet and DhGetRandom, which use
sam_cp. sam_cp will be extended to use a well configured gridftp
client.
Closed Feb 20
b. Jobs sit idle in the queue for a long time
Open Jan 19
Solution: our jobs request dedicated machines, not to be evicted.
Raising
the user priorities could make the situation better
Closed 04/27/2004
Using glow cluster in addition to the "p" cluster on an average now we
get around 100 machines.
c. Large fraction of jobs core dump
Open Feb 12th
Please refer DZero code, mc_runjob, users
Closed 04/27/2004
Not observed any more
d. mc_runjob fails to exit
Open Feb 13
Trying to find the exact reason why mc_runjob fails to detect the
completion
of the job.
Many jobs finish successfully (storing reco file in SAM) but for some
reason mc_runjob (Monitor Thread of mc_runjob) cannot detect job
completion.
This results in a lot of CPU's being held up by mc_runjob and decreases
overall throughput.
e. Problem with
dynamically
started gridftp server
Open Mar 6th
Dyanmically started gridftp server fails because of a bug in globus.
Solution: Moving away from the dynamically started gridftp server to
a static one at the same time having the option of starting one if
needed.
Closed Mar 9th
IN2P3
a. Some jobs exceed the max cpu limit of the longest queue
Open Jan 19
Investigating if it is related to a class of Requiest Ids.
Closed Feb 09
Investigation complete, CPU time limit exceeded due to particular
request
Id's (e.g. 10084,11249) for which run d0gstar for 18 odd hours.
Moved
to a new queue which has larger CPU time limit
Open Mar 30 - 2004
The temporary area ("/tmp") was cleaned up somehow. As a result lost
the status of files of polling scripts because of which the jobs got
held
at the grid level. Investigating how/who deleted the files in "/tmp"
area.
Possible solution is to use a separate area for creating temprary files
Closed 04/27/2004
Now we use a seperate area for creating tmp files
Manchester
a. Clocks of the worker nodes are out of synch
Open Jan 26
The failure are due
to
the validity of the proxies used by the gridftp client.
Solution: the
clients sleep
until the proxy become valid, if the clock is behind. We are also
working
with Sabah to fix the ntpd configuration at the cluster
Closed 04/27/2004
We sleep the difference in current time and proxy start time
b. SAM station disk gets oftern
deactivated
Open Jan 28
Under investigation.
Closed 04/27/2004
Not observed any more.
c. SAM station is often
restarted,
our ptojects are killed
Open Jan 29
Started intercepting mail from manchester station. Those messages went
to wrong addressees. MAN folks in denial.
Closed 03/18/2004
Station has been operating stably for over a month now
d. The dynamically started gridftpd dies while the job is still
running
Clients hangs forever in case the gridftpd is not available. Under
investigation.
- This was because of a bug in old sam gridftp which Gabriele has fixed
in the new release
Closed Feb 20th
e. No file can be stored in Enstore because of an eworker bug
(old stager
version is running).
Open 02/02/2004
Ugraded their sam_station to a newer version.
Closed Feb 18
f. Domain name cannot be determined from the worker nodes
As a result of this, sam_cp resorts to a normal cp ( sam_cp needs the
domain
name to map it to a protocol).
This results in the file transfer being slow.
Solution: Configuring the nodes of the cluster properly
Closed 02/27/2004
if we specify node name prefix in sam_cp_config.py , it works.
g. qstat command for polling fails because of failure to
connect
to server
Solution : Incorporate retrials.
Closed Mar 02
JIM or SAM
a. Failure to resolve true user name due to DBS communication
error
Open Jan 26
The clients fail when trying to get information from the sam db server.
We are currently gathering statistics on the impact of this problem
using
different db servers.
Depending on site, this eats up 40-60% of jobs. The problem only
exhibits
on sites where user jobs run under usernames other than those at FNAL.
Solution: eager approach: get the information needed as soon as
possible
(e.g. get the sam user when submitting the job). Introduce retrial
blocks.
Closed Feb 17
b. Extreme (15hrs and more) slow-down of SAM projects due to
pmaster project
thrashing and excessive (120K in one instance) calling to the DB, which
doesn't handle the traffic.
Open Jan 29 There are two projects that are run for each job:
- DhGet script retrieves all the dataset files "atomically" and
is used
to
retreieve D0 code and files that are packaged separately from it,
such
Magnetic field and cardfiles. Since all the jobs ask for the same
fileset,
it makes sense to explore the Freight Train mode, to reduce the total
number
of projects run (and thus reduce the load on the headnode).
- DhGetRandom script retrieves a random subset of a given size
from a
dataset
and is used to retrieve minbias files (JIM uses SAM whererver
possible).
The basic implementation of the script requests the entire dataset in
one
project and then keeps a small subset on local disk. This generates too
many calls to the DBS. This was fixed in the sense that a dynamically
created
sub-dataset is retreived.
We keep gathering the information about this.
Closed Feb 23
c. Stopping of the main SAM gridftp server results in stopping
of
ALL the grid_ftp servers, including those run by JIM.
Fixed in the new release of sam_gridftp.
Closed Feb 23
d. Project limit is spuriously exceeded at a SAM station
This is due to new projects starting before station has a chance to
clear
old projects in the DB, which happens whenever communication to the DB
is slow. Bandaid: increase project limit.
Now we use a new way (sam dh get lite) of getting files that doesn't
require to run projects to get already cached files - Thanks to the SAM
Team
Closed April 27th
e. SAM Db server is hogged by
queries to resolve snap id to file list.
Closed April 27th
A new optimization in Sam db server has reduced the time to this
resolution down to fractions of a second
f. SAM db server is hogged when a
dataset definition consisting of 1000 files is created.
Open April 27th
Possible solutions are optimizing the way by getting rid of the OR
operator. Investigation continues ...
Middleware
a. Condor client doesn not work from a lap top (DHCP)
Open Jan 21 - Closed Feb 05
See bug #8849.
Condor v6_6_0 fails to resolve the local hostname. Seems related to
dhcp. Condor v6_5_0 was working. Under investigation.
Resolution: condor must be able to reverse-lookup the IP of the laptop
via the DNS server, even when dhcp is used. Also, /etc/hosts must not
have
a wrong IP entry for the laptop name.
b. Condor scheduler never releases the job after submission
Open Jan 08
See bug #8914 and 8694.
The condor scheduler puts the job in hold while it transfers the input
sandbox from the client. Sometimes it never releases it. Seems related
to GSI.
c. Condor client v6_6_0 cannot use network aliases for the condor
scheduler
Open Jan 21
See bug #8849.
Condor client v6_5_0 was working.
d. Condor schedd fails to receive any jobs (at least from the
user
at hand) if a single GSI-related mishap occurred.
Example: Condor incorrectly handles the X509_USER_CERT and
X509_USER_KEY
variables, setting those to point to user credentials will disable any
communication between the user and Condor. Solution: kill schedd and
resubmit
all jobs. (is this an overkill?)
e. X509_USER_PROXY env variable
is reset by condor libraries for structured jobs (condor_dagman)
Open April 13th
The present condor_dagman binary mandates that the user proxy be
present in the /tmp area with the name x509up_uXXX, where XXX is the
unix uid of the schedd owner. This behavior is known to the Condor team
and they are coming up with a solution to this problem.
DZero code, mc_runjob, users
a. A particular phase of d0
code
goes into an infinte loop
Open Feb 17th
On Manchester the pythia phase just seems
to go in an infinite loop, under investigation.
b. Core Dumps (faulty error handling in mc_runjob)
Open Feb 12th
It's been observed that on the Wisconsin site (RH9) there are a large
number of core dumps by d0gstar, d0sim etc.
The code to handle core dumps is present in mc_runjob, but probably
is untested and incorrect. Thus jobs which dump core,exit as follow
Exception in thread Thread-2:
Traceback (most recent call last):
File
"/scratch/condor/execute/dir_19830/sam_client/python/lib/python2.1/threading.py",
line 378, in __bootstrap
self.run()
File "runtime_utils/ExecutionThread.py",
line
232, in run
self._RunManagers(task_dir)
File "runtime_utils/ExecutionThread.py",
line
160, in _RunManagers
mgr =
runManagerObject(item,item,dirname,self)
File "runtime_utils/BaseManager.py", line
16,
in runManagerObject
exec instantiator
File "<string>", line 1, in ?
File "runtime_utils/FileManager.py", line
22,
in __init__
self._CoreCheck()
File "runtime_utils/FileManager.py", line
214,
in _CoreCheck
event.setType('core')
UnboundLocalError: local variable 'event'
referenced
before assignment
This has been observed more often at Wisconsin(Could RH9 be the
reason
for a large number of core dumps ? ).
Investigation revealed that d0sim dumps
C++ runtime abort: terminate() called by the
exception
handling mechanism
./run_d0sim.sh: line 18: 24852
Aborted
(core dumped) $EXENAME -rcp d0sim/rcp/runD0Sim.rcp
and d0gstar dumps
Incorrectly built binary which accesses errno or
h_errno
directly. Needs to be fixed.
./run_d0gstar.sh: line 25: 21221 Segmentation
fault
(core dumped) $EXENAME -rcp simpp/rcp/sim_framework_mckine.rcp
<d0gstar.inp
the above error might due to incompatibilities
between
RH9 and RH7.x glibc component.
Please see
(http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=89286)
&
(http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=90002)
c. Duplicate filename
Open Feb 11th mc runjob genrates files with a name that has been
already
declared by some other local job of the same,4 grid bunch.
This is due the fact that mc_runjob uses time as the base for creating
a unique filename (called workid in the metadata).
Closed Feb 27th ( Now using the Uniqueness parameter in
mc_runjob)
d. Requests using Alpgen
Such requests cannot be run right now. Such requests d0gstar takes
input
from SAM and generally these input files are huge 50,000 events.
Solution: One solution is to have the user split this huge file into
smaller chunks (say 250 events in each file) and fix mc_runjob to pull
the file from SAM
Closed Feb 27th ( If the generated phase is already produced no matter
the number of events , We can pull it from SAM)
e. d0mess
The problem of using d0mess with 250 events job is that most of the
jobs
will not produce any data at all and will exit after running pythia
Solution: Its OK if some of them dont produce any data.
Closed Feb 27th
f. Improperly formed requests
Many requests don't have d0reco configurator specified in the request
details.
If run these will not produce any data.
Solution: Add the d0reco configurator to the macro after preprocessing
,but we prefer that the users create requests responsibly.
Closed Feb 27th
g. Explicit specification of the framework RCP file in the
request
breaks mc_runjob.
This is due to SAM DB forcing conversion of MC parameters to
lowercase
(according to Iain).
Solution: wipe out the framework_rcp line from the macro.
Closed Feb 27th
Page Started on Jan 28, 2004
$Id: deployment-issues.html,v 1.31 2004/04/13 22:35:22 aditya Exp
$