From mverzocc@fnal.gov Wed Feb 18 10:04:30 2004 Date: Wed, 18 Feb 2004 02:04:15 -0600 (CST) From: Marco Verzocchi To: Tibor Kurca Subject: Re: Reprocessing action items for discussion Hi Tibor 1) script for file merging: /home/mverzocc/scripts/mergeFiles.py /home/mverzocc/scripts/mergeUtilities.py to use it (currently works only on d0mino, for a silly reason, I have to make sure that the most recent version of evcopy does what I want and that it prints one line per event....., I am using a private version of evcopy right now) setup D0RunII p14.06.00 setup sam ensure that /home/mverzocc/scripts is in the PYTHONPATH environmental variable. These are really tailored for the common sample group skimming, there needs to be some modifications (choice of file name, criteria for selecting the files to be merged) to adapt them for the reprocessing (where probably the merging has to be restricted to a run basis and for the file name one wants to follow more or less the current schemes) mergeUtilities.py contains several classes: * fileMerger (does the low level file merging) methods: setMergedFileName - sets the name of the merged file setUserDefinedMetadata - allows the user to add generic metadata addFileToMerge - add one file to the list of files to be merged addOnlyMetadata - add an empty file to the list of files to be merged (a file with 0 events, you still have to merge the metadata) getListMergedFiles - return the list of files to be merged onlyOneFile - merge 1 input file into 1 output file, simply rename it (but in some cases have to do a mv command, os.rename doesn't work properly on every system) runEvCopy - does the file merging using EvCopy (need to change here to use official version of EvCopy) runCopyD0om - does the file merging using CopyD0om fileChecker - does the file checking after the file has been merged. in sequence: open the D0om file catalogue and count the events, run DsDump, calculate a checksum using the ecrc program (from setup encp) writeMetadata - write the metadata for the merged file. * samMetadata (deals with SAMManager generated metadata) methods: getMetadata - import a metadata file checkSize - check the file size against the one stored in the metadata changeAttr - change one of the metadata keys addAttr - add a key/value to the metadata addAttrList - add a key/list of values to the metadata addParents - add a list of parent files to the metadata getParents - get the list of parents getNumEvents - access the number of events getStreamName - access the stream name getDataTier - access the data tier addFile - add the metadata from another file to the metadata of the current file copyFile - copy the metadata for one file to a new file genMetadataFile - write a new metadata file (SAMManager style) * samStore (interface to SAM storage with error checking) methods: setGroup - set the group in SAM parseMetadata - try to figure the destination based on the metadata !!!!!!! AVOID IT !!!!!!!!!!!! storeFile - interface to sam store checkFile - check the metadata of the stored file against the copy on disk * mergeSingleFiles (merge 1 file at a time) methods: setMaxSize - set the maximum allowed size before merging setDontMerge - set the maximum allowed size of files which are not merged setForceMerge - set a flag for forcing the file merging even if the minimum required size of the merged file is not reached setStoreSAM - set a flag for storing a file into SAM immediately (if the file merging was successfull) setCleanFiles - set a file triggering a cleanup of the files which have been merged setGroup - set the group in SAM setUserDefinedMetadata - allow the addition of user defined metadata mergeFiles - create 1 merged file, perform checks and create the appropriate metadata nameMergedFile - create the name for the merged file samStore - store the file in SAM * mergeAllFiles (run the file merging recursively on a list of input files, until nothing is left) methods: startSAMStoreAfterNFiles - change the number of files after which a SAM store command is issued mergeFiles - create merged files samStore - store the merged files in SAM Some things need explanation: 1) this is for the common sample group skimming, where we merge randomly until we reach a certain file size..... mixing different runs 2) the filenames follow a certain pattern 3) the files to be merged together are in a single input directory Usage mergeFiles.py inputDirectory outputDirectory filename -datatier ..... -group ..... -recursive -streamname ..... -clean -force where inputDirectory is the directory containing the files to be merged outputDirectory is the directory in which the merged files will be written filename is the common part of the filename for example (from the output of 1 merging job): Creating merged file toSAM/CSskim-1EM2JET-20040208-123527-28930588.raw_p14.06.00 using the following input files: D0om initialized file 1: 1EM2JET/CSskim-1EM2JET-recoT_all_0000185746_mrg_001-010.raw_p14.06.00 (contains 120 events) file 2: 1EM2JET/CSskim-1EM2JET-recoT_all_0000185746_mrg_011-019.raw_p14.06.00 (contains 109 events) file 3: 1EM2JET/CSskim-1EM2JET-recoT_all_0000185746_mrg_020-028.raw_p14.06.00 (contains 96 events) ....... So all the files share a common name and the recoT_all_....... disappears... the -datatier option allows to overwrite the "datatier = '......'" metadata -group "group = '.....'" -streamname "stream = '.....'" (actually this last one is ignored by SAM, the stream name is the physical_datastream_name of the parents....) the -recursive option is used to merge recursively the files present in one directory until there is nothing left or until the total size of the files to be merged is smaller than some parameter the -clean is used to remove the input files if the merging worked properly the -force is used to force merging files even if they don't reach the required size the -store is used to store files in SAM as soon as the file merging is done (or after N files). 2) storing files in SAM: /home/mverzocc/scripts/storeSAM1406.sh /home/mverzocc/scripts/newStore2.py usage /home/mverzocc/scripts/storeSAM1406.sh directory_name where directory_name is the name of the directory which contains the merged files to be stored in SAM description a) script looks into the directory for files which have a certain name (there is wild card..... this could be easily modified) b) makes a list of files, check that for each file there is also a "file.metadata.py" and a "file.ecrc" c) stores files in SAM using newStore2.py d) check the output from newStore2.py and moves the files which had a problem during the file merging into a PROB directory /home/mverzocc/scripts/newStore2.py list_of_files group PNFS_location where list_of_files is the list of files to be stored in SAM group is the SAM group (ignored if the PNFS_location is provided) PNFS_location is the /pnfs/...... path.... newStore.py is just a Python wrapper for the samStore() class of mergeUtilities.py Marco On Mon, 16 Feb 2004, Tibor Kurca wrote: > Hi Marco, > could you please point us to the scripts you are mentioning below? > > thanks in advance > Tibor > > On Wed, 11 Feb 2004, Marco Verzocchi wrote: > > > Hi Mike > > > > since I've run on similar problems for the thumbnail fixing > > and the skimming..... > > > > 5) Merging: evcopy followed by a dsdump of the merged file > > is much faster than copyd0om and does ensure data integrity > > > > I have a script which does this (plus more), it needs to > > be adapted (I have some restrictions on the file names > > and I handle only metadata generated by SAMManager) > > > > Or one could use another script. On a fast disk (I was > > using d0mino for this reason, fast disks, slow CPU) > > evcopy takes roughly 1 minute to merge 1 GB of data, > > then dsdump needs about 10 minutes to check the data > > integrity. > > > > 6) Storing: after the dsdump step I run ecrc (the same > > checksum calculator used by SAM/enstore) and I dump > > the checksum in a ASCII file). I transfer the data, > > the metadata and the checksum (if the file transfer > > is necessary). Use the checksum on the remote end > > to check for errors in the file transfer (seen none, > > but this was internal to FNAL). Save the file in SAM, > > check the checksum against the one calculated by > > enstore, if they differ the file has been corrupted > > during the file storage. Mark it bad in SAM, change > > the name on disk, store again..... > > > > The second problem has occurred several times (too > > often ?). But end users have never seen any corrupted > > data, and I didn't have to reprocess anything as I > > still have the data on disk. > > > > While my scripts for doing this may not be easily adapted > > for general use, I think that this is a good approach. > > > > Cheers > > Marco > > > > A+, > Tibor > > +------------------------------------------------------------------+ > | Tibor Kurca | > | Institute de Physique Nucleaire de Lyon email: kurca@in2p3.fr | > | Groupe D0 | > | 43, Bvd du 11 Novembre 1918 Tel: +33 (4)-72-44-85-01 | > | 69622 Villeurbanne, Cedex Fax: +33 (4)-72-43-14-52 | > | France | > | | > | Centre de Calcul-IN2P3-CNRS Tel: +33 (4)-72-69-42-02 | > | 12-14, Bd Niels Bohr | > +------------------------------------------------------------------+ > >