Using SAM with cal_entpl Package


[ Introduction | Getting Started | Samifying Your Package | Creating A Datset | Submitting A Job ]

Introduction

|TOP|

The habit of putting multiple copies of many data and Monte Carlo files (raw, generated, reconstructed, etc.) on local disks should be discouraged. We need to use SAM (Sequential data Access via Meta-data) when analyzing data and MC to produce histograms, ntuples or rootuples. There is an enormous amount of information and capability in SAM, and for beginners like me that can be quite an obstacle. I carefully recorded the steps I took in successfully retrieving and analyzing a SAM file where the output was a rootuple This information is directed to the Calorimeter Electronics group who are in the transition from installation/commissioning to calibration, software development and physics analysis. The examples for others should be useful as the package used is the main difference.


Getting Started with the cal_entpl Package

|TOP|

Laurent Duflot developed the package, cal_entpl. The basic idea is to unpack the raw data and store the ADC counts of the electronics' channels as well as the status and PIB information of each event into a rootuple. The smallest unit of data will be a crate (0-11). We can development macros (which are just small C++ files) and run them in root, the result of which can be distributions of pulse heights, ADC channels, etc.

I am assuming you are doing your analysis on d0mino. There may be subtle differences to the setups and path names among other things if you do your work elsewhere. Much of this is covered in greater detail in Heidi Schellman's tutorial. I highly recommend reading this, as I only cover what is needed to setup and run a single package.

I save myself time by putting the following setups in the .login file in my home directory:

These setups add only a second or two to your login time. They will be executed automatically when you login. If you just changed them, you can have the setups take effect now by typing: Next, type the following command: Now you are ready to create a working area and add a package. Type: The choice for directory name (in this case t151) is entirely up to you. Some people prefer t1.51 or test, as examples. The "-h" option gets you the most recent version from cvs. d0setwa sets up the d0 working area scripts for the D0RunII version.

You do not need to set up a new release each time you login to d0mino. The new release is only necessary if you wish to use the latest and freshest D0 code. It is not uncommon to be using a version one or two generations behind the most recent. The same is true for adding a package. If the author of a package has informed you of a change, or if you want to make sure you are using the head version, you can obtain the updates by issuing the following command:

or you can get the exact version:


Samifying Your Package

|TOP|

Before working with SAM, you need to Register. Select all the working groups that apply to you (including cal).

Before you compile your package and produce an executable, you need to samify your package. For the cal_entpl package, Laurent has already done most of the work for us:

  1. He obtained the file SAMManager.rcp by the command:
      cvs checkout sam_manager/rcp
    Laurent renamed the file SAM.rcp, and it can be found at:
      {your scratch area on d0mino}/t151/cal_entpl/rcp/SAM.rcp
    I made a few modifications, and I would suggest you copy my version as listed in the above Note.
  2. He edited the cal_entpl/bin/OBJECTS, adding REGSAMManager
  3. He edited the cal_entpl/bin/LIBRARIES, adding sam_manager
I have included the rest of this section in the event you are using some other package, or creating a new one. There is a script you can use to samify your package. Type: samify script
#!/usr/local/bin/tcsh -f
#cvs checkout sam_manager/rcp
echo "sam_manager" >> $1/bin/LIBRARIES
echo "RegSAMManager" >> $1/bin/OBJECTS
exit 0

It will change the OBJECTS and LIBRARIES but not check out sam_manager.

Next, go to your working area directory, and type the following commands:

The results will show up in the /bin and /lib subdirectories of your working area, and may take a couple of minutes.

There is a framework.rcp file in the directory:

The contents will look like:

If not, I recommend that you modify your file so it does. sam has been added to the string of packages, and a RCP sam line has also been added.

If you wish to retrieve and analyze a SAM file, you need to modify the ReadEvent.rcp file. Comment out:

and add: Don't forget the colon at the end!
Later, You may decide to run over a local copy of a raw data (or list of raw data files), in which case you will want to use "input" and not "SAMInput:".


Creating A Dataset

|TOP|

A dataset is any combination of files arranged by date, file name, run number, type or data tier (raw, simulated, reconstructed), physical datastream (all, calibration, test), the person who created a dataset and more. A dataset can be comprised of a single data or monte carlo file, or many (I created one dataset which contains over 200 raw data files). Before browsing through the SAM database, you should really begin at the D0 Runs Database. You can search under trigger configuration name to look for a run or set of runs trigger on certain types of triggers. For example: The run number can be the most useful key word in defining and creating a dataset in SAM. When you bring up a run in the Runs Database, you can click on the button with the run number inside. This will bring up the list of triggers used in the run. Clicking it a second time will bring up the list of crates. For example, if you did a search on Run 124093, you would find that it qualifies as a global run with collisions. However, if you click through on the button twice, you will discover that the Calorimeter crates were not part of that run. Therefore, Run 124093 is not useful for doing calorimeter electronics studies. You can also try searching for runs from a particular store, or after a particular date, etc.

Now you will want to use a previously created dataset, or create one of your own in SAM. Go to the Sam Data Browsing page. If you know about a dataset created by someone, you can search in Datasets by name, date, user, etc.

If you want to create a dataset of your own, you will need to work with a set of parameters in order to narrow your search. Type the following commands:

If you are like me, you may have found the output of the first line rather cryptic. The second command will list plenty of options with explanation and examples. The --dim option is more versatile. Look at the output for sam translate constraints --dim=help:

Specify dimensions and constraints combined with and/or/minus operators as in these examples:

  --dim='file_name %ztautau% and data_tier digitized'

  --rpn='file_name %ztautau% data_tier digitized and'

  --dim='file_name %ztautau%,%ztigtig% or physical_datastream_name e+j'

  --dim='(data_tier digitized and appl_name d0reco and version preco03.07.00) minus run_number 40041'

Available dimensions (not case sensitive):

APPL_NAME : Application Name that was run on other files, resulting in the production of this file.
APPL_NAME_ANALYZED : Application Name that was run to analyze this file.
CREATE_DATE : Date the file was created.
DATASET_DEF_ID : Dataset definition id for a definition that contains this file in one of its datasets.  
     Useful with the Dataset_Version dimension.
DATASET_DEF_NAME : Dataset definition name for a definition that contains the file in one of its datasets.  
     Useful with the Dataset_Version dimension.
DATASET_ID : Numeric ID of a dataset that contains the file.
DATASET_VERSION : Version of a dataset that containts the file.  Useful when combined with either the 
     Dataset_Def_Id or Dataset_Def_Name dimensions.
DATA_FILE_LOCATION_STATUS : Status of the data file location.
DATA_FILE_NAME : Unique name of the file in SAM.  The wildcard (%) is very useful when using this dimension.
DATA_TIER : Data tier of the file.
DELIVERED_STATUS : Status of the file delivery.
EVENT_NUMBER : Event number contained within the file.
FAMILY : Application Family that was run on other files, resulting in the production of this file.
FAMILY_ANALYZED : Application Family that was run to analyze this file.
FILE_ANALYZED : Name of a data file that was analyzed to produce this file.
FILE_NAME : Unique name of the file in SAM.  The wildcard (%) is very useful when using this dimension.
FILE_STATUS : Status of the data file.
FULL_PATH : The full path of the data file, including disk or tape location.
LOGICAL_DATASTREAM_NAME : The name of the logical datastream contained in this file.
PATH : The path of the data file, excluding the file name itself.
PHYSICAL_DATASTREAM_NAME : The name of the physical datatream contained in this file.
PROJECT_NAME : The name of the project that was run to produce this file.
RUN_ID : 
RUN_NUMBER : The Run Number that created this file.
RUN_TYPE : The Run Type of the run that created this file.
RUN_TYPE_ID : The numeric ID of the Run Type that created this file.
TAPE_LABEL : The label on the tape that contains this file.
VERSION : The Application Version that was run on other files, resulting in the production of this file.
VERSION_ANALYZED : Application Name that was run to analyze this file.
__SET__ : Special dimension allowing you to query all files that match another dataset definition name.  
     This is useful for combining with union/and/or operators on your own set of dimensions.  Simply use 
     __SET__ as your dimension name and the name of your existing definition as the constraint value, e.g. 

     --dim='file_name %ztautau% minus __set__ my-files-already-analyzed'

The dimension __SET__ is a special dimension which lets you combine prior dataset definitions into your 
new dataset definition, simply use __SET__ as your dimension name and the name of the existing
dataset definition as the constraint value, e.g.

     --dim='file_name %ztautau% minus __set__ my-files-already-analyzed'

For additional information on Dimension Names, Constraint Operators and Set Operators, go to the page on SAM Dataset Definition Grammar.

If you want to look at recent raw data from a global run, you could type:

The output would be: Other options for physical_datastream_name are store_1x8, cosmics, daq_test and calibration. As for data_tier, you can find more options on this SAM Query Page. The full sam translate constraints --dim='...' command is issued on a continuous line.

Attention:  The physical datastream name is just a part of the file name in SAM. The all stream is currently synonomous with 36x36 collisions, i.e. real physics. However, store_1x8 has accidentally been used in the names of files recorded during proton and pbar halo runs. It is better to search the Runs Database for particular trigger configurations, and take that information to SAM when creating a dataset.

Now you are ready to Define Your Dataset. Taking the example from above:

The --defname is up to you, but try to make the name meaningful as others may want to use it. When you registered with SAM, you selected from a list of Working Groups. I would recommend using dzero or cal when working with the cal_entpl package. The --dim option is set between single quotes.

You should practice using the Sam Data Browsing. Search for your Dataset Definition with key words like:

The next step is to Create Your Dataset. The option --snapdesc is set to a brief description between double quotes. After executing this command, SAM will return a dataset ID. Now check the Sam Data Browsing page and search for your Dataset:

There is an alternative, and in my opinion, easier way to create a dataset in SAM. The Dataset Definition Editor interface allows you to create a new dataset or clone a previously defined dataset. As an example of the latter, you can click on Person, then the user name alstone. There you will find several datasets. Click on raw_run_124110, and you will see the following in your browser:

You cannot edit this dataset, but if you click on the clone button, a copy of the dataset is produced. You can then edit the clone to fit your needs. The clone will look like:

You can change the name, the group, user and/or dimension query. If you do change the dimension fields, you can always click on the translate button, and an updated dataset will appear in the window at the bottom of the page. Once you are satisfied, you can save the cloned dataset, which is the same as defining a new one. Make sure you are happy with the name before saving it. The default name justs adds a Clone- prefix to the original definition name.

Wyatt Merritt put together a quick tutorial on the Dataset Definition Editor with more details than I gave above, particularly in the case of starting at the beginning with a New Dataset.

Please Note:   SAM does not guarantee that the files in your dataset will be delivered in a particular order. If you need to work with only a single file, or you want to repeat your analysis on the same set of datafiles with a new version of your code, you should not rely on SAM to provide the files in the order and combination in which you need them. A python script exists that people have used to cache files so they could analyze SAM files in a particular group or sequence. However, the sam run project command is being deprecated, and should no longer be used with the python script.

You can search for files and/or datasets to find out whether particular files are already cached on d0mino. Cached files should stay in cache indefinitely, unless the disk quota for that working group has been reached. If that occurs, the oldest unused cached files will be deleted to make room for new requests. You can check the status of the project and disk quotas by typing:

    sam dump station --projects
If there is a concern that the cached files will get deleted before your analysis is completed, you may want to place a copy of the files in your favorite scratch area as a temporary solution. In doing this, you have basically circumvented SAM. Before running your executable, you will want to return the ReadEvent.rcp and framework.rcp files to an "un-Samified" state (See above to reverse the Samify instructions).

You are now ready to retrieve and process a SAM file with your package.


Submitting A Job

|TOP|

You have basically two options for submitting a SAM job:   interactive or batch. If you are testing your code and/or running over a small number of events, interactive is the way to go. On d0mino, you are limited to a maximum of 60 minutes for a single interactive job. If you think your job will take longer than an hour, you should submit to batch. Check out the d0mino Batch usage page. As of June 27, 2001, the SAM batch usage changed. There are now two options: sam_lo or sam_hi. However, the user has no choice in the matter, so don't try to specify the queue. SAM will make the decision if the job is "high" or "low" priority. The other batch parameters can still be used.

I have modified some scripts that you can use to submit an interactive or batch SAM job. Only the essentials options and parameters are provided, but the scripts are complete and get the job done.

The only parameter you are likely to change from job to job is SAM_DATASET. If you use a different executable, you will need to change EXEC. The GROUP value of cal is fine unless we reach some group or disk limit, in which case you may try dzero. To submit your batch job, type:

You should see something that resembles the following:

      
    [d0mino]< alstone > ./sam_job.sh
    SAM_PROJECT raw_run_124236_v1_06_28_01_02_22
    SAM_DATASET raw_run_124236_v1
    GROUP cal
    SNAP_VERSION last
    EXEC bin/IRIX6-KCC_3_4/CalElecNtupleMaker
    FRAME_RCP -rcp framework.rcp -num_events 50000
    BATCH_JOB -N -o my_job_log
    >>>>>> Starting project with the Station Master
    Station Master contacted, result: Started project 29462(raw_run_124236_v1_06_28_01_02_22) for group cal
    Waiting for the project to initialize...
    Callback from server: 'OK|Project is ready'
    >>>>>> Submitting the job to the batch system
    >>>>>> Executing: bsub -P raw_run_124236_v1_06_28_01_02_22 -N -o my_job_log -q sam_lo 
    /usr/products/sam_user/IRIX-6-5/v3_1_3/bin/samscript.sh 
    framework_wrapper.sh raw_run_124236_v1_06_28_01_02_22 central-analysis bin/IRIX6-KCC_3_4/CalElecNtupleMaker 
    -rcp framework.rcp -num_events 50000 
    Job < 16037 > is submitted to queue < sam_lo >.
    
    [d0mino]< alstone >
    

I did a command line override for the Number of Events with -num_events 50000. An output log file is created for the job.

I saved the job log from a batch job which processed 5130 events successfully from Run 124038.

To follow the progress of your run, you can check on the status of your project, or go to the command line and type:

Also review the d0mino Batch usage for other batch commands.



Last modified: Mon Jul 16 22:30:41 CDT 2001
Web page maintained by Alan L. Stone: alstone@fnal.gov