From Thomas.Nunnemann@Physik.Uni-Muenchen.DE Mon Feb 16 11:55:27 2004 Date: Wed, 11 Feb 2004 10:25:14 +0100 From: Thomas Nunnemann To: diesburg Cc: d0-data-reprocessing@fnal.gov Subject: Re: Reprocessing action items for discussion Hi Mike - I am reporting about your point 3 (database acces) in today's GCAS meeting. Slides are available from http://www-d0.hef.kun.nl///fullAgenda.php?ida=a04228. Thomas diesburg wrote: > Below is a list of areas where I think we need > effort for the next reprocessing run. I don't claim this > to be complete. It is a starting point for discussion. > I have probably overlooked some obvious items. > I think each of the areas listed below is a place > where we need to identify a responsible individual or group. > > Mike > > > ------------------------------------------------------------------------ > > > 1) Project assignment > > Need web page where sites can sign up to do specified > runs. Page should produce project definitions that > identify the remote site and are of appropriate size. > It must also ensure no duplicate assignments are made. > > 2) Data delivery > > Need to spec out server to handle both data delivery > of raw files and storage of DSTs and thumbnails. > Probably want to setup a dedicated node with Gb > connections and plenty of buffer space. > > Should be a single solution suitable for all remote > sites. Need to understand what changes/improvements > need to be made in sam so everyone can use routing > station for delivery. > > Need to understand how we will handle tape drive > access. We will have to move 5-10 times as much > data to remote sites for next round of processing. > We will not be able to restrict other tape intensive > activities for extended periods. Need to explore > ways to prioritze access or possibly dedicate drives > to remote delivery. > > 3) Database access > > Need extensive testing of proxy DB servers. How > does this scale? Is it robust enough? What resources > are needed on FNAL end to support remote access? Can > data be cached locally ahead of processing to reduce > load? What do we need to do to make this usable at > all remote sites? > > 4) Certification/monitoring > > One of the experiments goals for this year is to have > certification run as a part of normal processing. We > should expect to do this for the reprocessing as well. > Should have recocert incorporated into mc_runjob. > Someone will need to set this up and take care of storing > results to a central location. > > Need to have a well defined line of responsibility for > initial certification and ongoing monitoring of results. > We should insist that representatives of physics groups > be designated to do this. > > Should also have a central repository for a standard set > of plots from the run-time info produced by mc_runjob. > What additional pieces of information shoule runjob > produce that would be helpful to real-time monitoring? > Do we have a standard mechanism for displaying the > runjob info? > > > 5) Merging > > Need a portable, robust, merge script that is usable at > all, or at least most, remote sites. May be difficult > because it requires significant amounts of sam DB access > and also requires central access to all output files from > a run. > > Need to explore what is best utility to use for merge. > CopyD0om is very slow, but does ensure integrity of input > files. Evcopy is fast, but will just pass along corrupt > input. Output still needs to be checked. This step can > easily turn into a processing bottleneck. Might have to > parallelize the merge. > > Robustness of this step to restarts is critical to avoid > getting duplicate events in the data record. > > 6) Storing > > Since the merges will be done locally next time, the stores > This will requires a robust storage script which can identify > known failure modes and do appropriate retries. > There was much more corruption of thumbnail files in transport > to d0mino than I expected to see in the reprocessing run. > We will need to take advantage of all integrity checking > mechanisms that sam has available and possibly add some of > our own. This will likely require quick turn around for > modifications of this script as real problems will probably > not manifest themselves until we are in actual production. > > 7) mc_runjob enhancements > > I am not sure what might need to be done beyond what is > noted above with respect to monitoring. Ideally runjob > would use a robust version of d0rte and all remote sites > would need to do is get the latest versions in the > production release to run. But that's probably just a > fantasy. We need to have someone in charge of this. > Iain is the most likely victim, but he might be getting > a bit tired of doing it.