INSTRUCTIONS FOR RUNNING RECOCERT PRODUCTION Meenakshi Narain Oct 25th, 2008 Updated Dec 14, 2008 Updated Jan 1st, 2009 Updated March 29, 2009 Updated December 17, 2009 RECOCERT STATUS WEBPAGE: http://www-d0.fnal.gov/computing/recocert/STATUS/current.html ======== RUNNING RECOCERT: cd /prj_root/2665/quality_data/recocert_Oct2008/RECOCERT/recocert_utils 1) Stage 1 - get runs and submit jobs: -------------------------------------- What is done in this step: - Make the runs list - generate Control file - Make SAM dataset defs - Submit recocert jobs - update the recocert date/version processed list - update webpage Which Commands to execute: Kinit -f -r 7d source setup.this ./uber_RecoCert.sh Cross Checks and background information: - check the files (list of runs to be processed, and control files) ../../makelist/RUNLISTS/RecocertNeed.*. ../CONTROLS/Control.. ../CONTROLS/Control.Done.. ../CONTROLS/RECOCERT_PROCESS_DATES.txt (this file is used by the webpage maker to include the all the files processed). To check all jobs have been submitted and no problems incurred make sure that you have the number of lines on the screen are the same as the ones on ../CONTROLS/Control.. Also, make sure that the two command below give the same # of lines. This then means all jobs have been submitted. wc -l ../CONTROLS/Control.. wc -l ../CONTROLS/Control.Done.. If they are not the same length files rerun the submit_runRecoCert command: ./submit_runRecoCert.sh e.g. ./submit_runRecoCert.sh 20081024 If you run submit_runRecoCert.sh by hand, you must also: cp Control.Done.${RECO_VERSION}. ${CONTROL_SAVE_DIR}/ to get those up to date. 2) Step 2: the wait stage --------------------------- This is the waiting stage while jobs are running. They can take somewhere between 2-4hours of CPU After a while check to see if jobs are completed or held or stuck - if no jobs appear using the command qstat -u @d0cabsrv1 (e.g. I do qstat -u meena @d0cabsrv1 ) then go to step 3. qstat recocert@d0cabsrv1 | grep is similar but shows the cpu time per job rather than wall clock time. qstat -Q @d0cabsrv1 gives a quick overview of all the queues on d0cabsrv1. qstat -Q recocert@d0cabsrv1 gives a summary of the only recocert queue. If jobs are held or stuck - delete them using qdel ( or stop_jobs.sh) On d0mino0x you can also try purge_job . The latter will kill some jobs that pbs has lost track of. As a last resort send a message to helpdesk@fnal.gov asking them to delete jobs that neither of the procedures above can kill. qdel ie qdel 123456.d0cabsrv1.fnal.gov NOTE, must have the entire .fnal.gov part of the name. stop_jobs.sh runs qdel on a list of jobs created by: qstat -u ${USER} @d0cabsrv1 | grep recocert > stop_jobs.list or better qstat recocert@d0cabsrv1 | grep ${USER} > stop_jobs.list The former qstat show you the elapsed time per job, the latter cpu time per job, so you can add a grep for something like 00:00: to find stuck jobs. If the qdel doesn't work, log onto one of the d0mino machines and run purge_job ie purge_job 123456.d0cabsrv1.fnal.gov Once all jobs are done running - go to next step. 3) Step 3: check jobs and resubmit -------------------------------------- What is done in this step: - Check if all recocert jobs completed, - if not make a list of jobs to be rerun - clean up bad recocert files - move Summary files to correct location - copy good root and metadata files over to the BUFFER area - if this is the 2nd recovery check, then failed logs also copied - submit the recocert jobs to be rerun - make control files and copy them over the correct areas - update webpage Which Commands to execute: ./checkRecocertJOBS.sh e.g. ./checkRecocertJOBS.sh 20081024 the *HAS* to be the same one which was created in Stage 1 ****NOTE*** Look to see if there is a file runs.ToRestartTake2.. is created. This file is created only if the job died due to the dataset definition being deleted. Happens when the job is pending for a long time. In this case, one needs to take special action, and remake the dataset definition, the control files etc and rerun. OR alternatively, just remove the runs by editing the following files, to remove any occurances and then they will be part of the next cycle of recocert: ./Control.. ./Control.Done.. ../CONTROLS/Control.. ../CONTROLS/Control.Done.. ../../makelist/RUNLISTS/runs.ToProcess.*. *********** Which files are created and background information: ../CONTROLS/RecoverControl.. ../CONTROLS/RecoverControl.Done.. the script above (checkRecocertJOBS) also does the following - moves good root and metadata files to ./../BUFFER/ - it copies Summary files (RecoCert___.Summary) to ../SUMMARY/ - and move log files for jobs with failures to ../FAILED_LOGS/ (this is done only after the second recovery step). To check all jobs have been submitted and no problems occured make sure that you have the number of lines on the for the following files are the same: wc -l ../CONTROLS/RecoverControl.. wc -l ../CONTROLS/RecoverControl.Done.. If they are not the same length files rerun the submit_runRecoCert command (note there are 2 arguments, as opposed to 1 earlier in step 1): ./submit_runRecoCert.sh Recover e.g. ./submit_runRecoCert.sh 20081024 Recover 4) Step 4: the wait stage --------------------------- Again - same as step 2 - wait for jobs to finish - or delete them if they are stuck. 5) Check the jobs which were rerun ------------------------------------- What is done in this step: Same as step 3, but it looks at the files lis which were to be recovered. Which Commands to execute: ./checkRecocertJOBS.sh Recover e.g. ./checkRecocertJOBS.sh 20081024 Recover Which files are created and background information: creates ../CONTROLS/RecoverTake2Control.*. Now we assume that the files in RecoverTake2Control.*. have been run twice and the errors are real, so we declare these files "bad" and also declare the cycle done. 6) Check LBNs --------------- What is done in this step: Next we need to find out if ALL LBNs for a particular run are included in the recocert files or not. The webpage is also updated Which Commands to execute: cd /prj_root/2665/quality_data/recocert_Oct2008/CHECKLBNS source ../RECOCERT/recocert_utils/setup.this setup python_dcoracle ./checkLBN.sh Background Information and files created: reports which runs have missing LBNs and the fraction lost creates a log: ./LOGS/checkLBN_Summary.*. e.g. checkLBN_Summary.p20.12.05.20081019 there are 5 columns in this summary file run# #lbns-in-tmb #lbs-in-recocert %diff-tmb-recocert %percent-loss the %diff-tmb-recocert should be 0 for all DONE. also a file diff* is created in LBN_SUMMARY subdir for those runs with missing LBNs. This file includes the list of missing LBNs. 7) Store files in SAM -------------------------------- What is done in this step: Once the file status are determined and LBNs verified, then store these files to SAM. Which Commands to execute: *IMPORTANT* LOGIN to d0srv069.fnal.gov This step cannot be done from any other node. (cannot connect from offsite, need to ssh to clued0 and then onwards from there). cd /prj_root/2665/quality_data/recocert_Oct2008/SAMStorage source /fnal/ups/etc/setups.csh setup sam nohup ./submit_SAM.sh & Command to execute after some wait - say 2 hours.. can check the total count of files to be transferred by wc -l files.p20.12.05.20081019 and the current status by wc -l out.p20.12.05.20081019.log If the two above counts are the same, then the copy to SAM is done and one can update the webpage. The next command needs to be done from another machine than d0srv069: ../make_web_page.sh UPDATE OF WEBPAGE IS IMPORTANT AT THE END OF THE STEP, AS THIS IS THE FILE WHICH IS USED TO DETERMINE THE NEXT SET OF RUNS TO PROCESS. http://www-d0.fnal.gov/computing/recocert/STATUS/current.html Background Information and files created: And also check the job log file - though this is verbose/long. Best is to check for occurances of failures and "already exists": egrep -c "fail|already" SAMStore_LOG_20081019 (note that you should get zero results. If not do the two individually to determine why.) and the successes: egrep -c success SAMStore_LOG_20081019 (note that you should get a number equal to the files to be transfered - wc -l /prj_root/2665/quality_data/recocert_Oct2008/SAMStorage/files.list ) In principle one can do the following as well to run the job: nohup ./StoreInSAM.sh >&SAMStore_LOG_ & 8) move all *Control* files in /prj_root/2665/quality_data/recocert_Oct2008/RECOCERT/recocert_utils ../temp subdir - later will be deleted. ===================== INSTRUCTIONS ON HOW TO CHANGE RECO VERSIONS: A) In case the reco version of *INPUT TMB* changes: 1) edit the file /prj_root/2665/quality_data/recocert_Oct2008/RECOCERT/recocert_utils/defs and change the following two lines. VERSION=p20.12.05b RECO_VERSION=p20.12.05b 2) updated the LBN check script: /prj_root/2665/quality_data/recocert_Oct2008/CHECKLBNS/PrintLBNDir.C change: string infilename = "../BUFFER/"+proc_date+"/cert_"+runnumber+"_p20.12.05b_"+jobnumber+".root"; to reflect the correct version of the input TMB file. ============================= B) To reinstall a new version of recocert analysis package do the following cd /prj_root/2665/quality_data/recocert_Oct2008/RECOCERT/recocert_utils/ 1) mv recocert recocert__BEFORE_ 2) edit the file "setup.this" and change the reco version number for example setup D0RunII p20.12.05b -O SRT_QUAL=maxopt should be changed to setup D0RunII p20.12.06 -O SRT_QUAL=maxopt 3)execute the command: source setup.this 4) getthe correct version of recocert/rcp setup d0cvs cvs co recocert/rcp 5) now if the input TMB file version also needs to change then edit the "defs" and PrintLBNDir.C file as indicated in section A ======================== Note that the INPUT tmbs could be different in recoversion compared to the recocert analysis package. For example p20.12.06 recocert could be used to analyzed p20.12.05c input TMBs. ============================================================= The scripts developed above were based on srcipts developed by Alan Jonckheere and Supriya Jain in Aug 2008 =================================================================