1) Processing assignments: We had problems in two areas with respect to processing assignments. 1) Making the sam projects for remote sites centrally caused some delays in getting files delivered. Although we will still need a central method to assign processing tasks, there should be no need for this to be done manually as long as we are processing from raw data. Processing from DSTs required that the correct version of the DST be selected on a run by run basis. 2) Shifting processing priorities also caused delay in delivering and in some cases files already delivered were note processed. This is an inevitable result of not having stable, predictable code. Major changes in d0reco efficiency will always force a re-evaluation of the processing plan. We desperately need a stable code base as early as possible. We also should not worry about maximizing any particular subset of the data in time for conferences. I think the idea of not being driven by conference schedules has at least been supported if not fully endorsed by the spokesmen. To me it seems the priority setting is simple. The most recent data is always the most interesting and we should start with the latest data and work linearly backward to the earliest. 2) Data transport: There were several problem areas associated with the delivery of files to the remote sites and the transfer of thumbnails back to FNAL. 1) Competition with other tape intensive activities caused severe delivery delays. This was solved by simply temporarily halting the conflicting activities for a short period of time. However, that is not likely to be an acceptable solution for the next round when the reprocessing will span many months. We will need to have either dedicated drives or a well understood way to control access priorities. 2) Competition for resource in the central sam router also lead to bottlenecks. This was alleviated by setting up a separate router station for the reprocessing effort. Unfortunately using the central router was not convenient for all sites. We also setup a separate station to do local staging to accommodate file transfer. While this worked, we should try to avoid implementing multiple solutions for the same problem. We need to understand what is necessary so all sites use the same central resources. 3) All file transfers, both to and from FNAL, went through the default interface on d0mino. Although there was not clear evidence that this was a problem, it is likely that a bottleneck could form here. We should have the off-site traffic routed through a dedicated interface (and possibly a dedicated machine). 4) There were a significant number of thumbnail files that were corrupted in transport to FNAL. We need to have checking mechanisms in place to protect against such transport failures. Initiating all stores from the remote sites would allow the sam checking mechanisms to be used. This would help, but it would also be prudent to implement some additional checks as per Marco's suggestions. This opens the question of whether we need to do any integrity checking on the input data after it arrives at the remote sites. I don't recall any evidence that we had any corruption of data transported form FNAL. This might be indicative that sam's checking mechanisms are adequate. 5) Network bandwidth to some sites was not really adequate to service a large installation of worker nodes. I don't think there is any real solution here except prestaging the input files well in advance of processing. Even with that tactic we might run into trouble. If the processing rate exceeds the network rate the farms will eventually catch up with the data stream. The only real way to guarantee this doesn't happen is to prestage everything a remote site will do before we start. But that requires substantial storage capacity at the remote sites for the input data. Comment: All the above problems will be much worse in the next round of reprocessing. We will have to transfer 5-10 times as much data to the remote sites. We will also need to make a decision as soon as possible about what to do with the DSTs (store locally or at FNAL?). 3) Code/script issues. There were problems with both the d0reco code itself as well as the the scripts used to run the code. 1) The poor state of the d0reco executable in the early planning stages lead to missteps in setting priorities and significant unnecessary effort expended in setting up to do selective processing. 2) The d0reco code itself was not properly tested before releases were made. Remote sites had to deal with releases which had incorrect RCP files, code which required DB connections that were not available, and high failure rates. The code should be tested in a sterile environment and be known to run at ~1000 event level before a production release is made. 3) Implicit dependencies required significant effort with each new release to identify specific libraries and/or products which needed to be included in the tarball distributed to remote sites. This was time consuming since this is usually a trial and error process. We need a fully functional and well tested rte as part of the release. 4) 4) Certification process. Running recocert was not a problem, but the line of responsibility for blessing the results was muddy. 1) We weren't really able to get the attention of the physics groups to pass judgment on the recocert output. This responsibility was eventually passed back to us. There needs to be an appointed group of people (not the people doing the reprocessing) who are charged with examining the recocert output from certification runs and declaring the results ready for production. 2) Actually running recocert was not a large burden (maybe Phil thinks otherwise?), but it should be automated. It should be incorporated into mc_runjob so it can be turned on or off by flipping a switch. 5) Bookkeeping. This was basically every man for himself. 1) We didn't have a common way of doing job accounting at the remote sites. This resulted in a lot of duplicated effort at sites for each one to come up with their own methods of tracking job completion and resubmission. 2) Not storing the output files immediately into sam prevented some sites from using sam to do the job accounting. Problems with sam reliability lead others to make their own local databases to do the same job. Doing the thumbnail merges at the remote sites and initiating all sam stores remotely should allow the sam DB to be used to track all job completions. 6) Miscellaneous. 1) Including the site name in the file names was a mistake. This initially seemed like a good idea, but it has the drawback that duplicate events can get into the data set if the same files somehow get processed at more than one site. And, yes, it did happen. If we really need this info we should investigate some other way to record it. It is encoded in the process ID, but this is essentially inaccessible and hence not of much use.