From diesburg@fnal.gov Wed Feb 11 17:23:59 2004 Date: Wed, 11 Feb 2004 02:04:25 -0600 From: diesburg To: d0-data-reprocessing@fnal.gov Subject: Reprocessing action items for discussion Below is a list of areas where I think we need effort for the next reprocessing run. I don't claim this to be complete. It is a starting point for discussion. I have probably overlooked some obvious items. I think each of the areas listed below is a place where we need to identify a responsible individual or group. Mike [ Part 2: "Attached Text" ] 1) Project assignment Need web page where sites can sign up to do specified runs. Page should produce project definitions that identify the remote site and are of appropriate size. It must also ensure no duplicate assignments are made. 2) Data delivery Need to spec out server to handle both data delivery of raw files and storage of DSTs and thumbnails. Probably want to setup a dedicated node with Gb connections and plenty of buffer space. Should be a single solution suitable for all remote sites. Need to understand what changes/improvements need to be made in sam so everyone can use routing station for delivery. Need to understand how we will handle tape drive access. We will have to move 5-10 times as much data to remote sites for next round of processing. We will not be able to restrict other tape intensive activities for extended periods. Need to explore ways to prioritze access or possibly dedicate drives to remote delivery. 3) Database access Need extensive testing of proxy DB servers. How does this scale? Is it robust enough? What resources are needed on FNAL end to support remote access? Can data be cached locally ahead of processing to reduce load? What do we need to do to make this usable at all remote sites? 4) Certification/monitoring One of the experiments goals for this year is to have certification run as a part of normal processing. We should expect to do this for the reprocessing as well. Should have recocert incorporated into mc_runjob. Someone will need to set this up and take care of storing results to a central location. Need to have a well defined line of responsibility for initial certification and ongoing monitoring of results. We should insist that representatives of physics groups be designated to do this. Should also have a central repository for a standard set of plots from the run-time info produced by mc_runjob. What additional pieces of information shoule runjob produce that would be helpful to real-time monitoring? Do we have a standard mechanism for displaying the runjob info? 5) Merging Need a portable, robust, merge script that is usable at all, or at least most, remote sites. May be difficult because it requires significant amounts of sam DB access and also requires central access to all output files from a run. Need to explore what is best utility to use for merge. CopyD0om is very slow, but does ensure integrity of input files. Evcopy is fast, but will just pass along corrupt input. Output still needs to be checked. This step can easily turn into a processing bottleneck. Might have to parallelize the merge. Robustness of this step to restarts is critical to avoid getting duplicate events in the data record. 6) Storing Since the merges will be done locally next time, the stores This will requires a robust storage script which can identify known failure modes and do appropriate retries. There was much more corruption of thumbnail files in transport to d0mino than I expected to see in the reprocessing run. We will need to take advantage of all integrity checking mechanisms that sam has available and possibly add some of our own. This will likely require quick turn around for modifications of this script as real problems will probably not manifest themselves until we are in actual production. 7) mc_runjob enhancements I am not sure what might need to be done beyond what is noted above with respect to monitoring. Ideally runjob would use a robust version of d0rte and all remote sites would need to do is get the latest versions in the production release to run. But that's probably just a fantasy. We need to have someone in charge of this. Iain is the most likely victim, but he might be getting a bit tired of doing it.