Farm Dataset Audit Information
The audit
file contains information on the completion status of physics
datasets processed on the production farm. It is
updated on a daily basis at ~01:00AM. Information is
provided on the completion status of unmerged thumbnails, merged
thumbnails, and RecoCert files. The columns in the file are
described below.
Project Definition Name
The
project defnition name listed in column 1 is the name of the
project definition which selects the raw data files from
SAM.
The definition names have the following format:
dayset-yyyy-mm-dd-stream-run-index
where
dayset
= Just an identifier for the farm projects, it means nothing
yyyy-mm-dd =
Date on which the first partition of raw
data in this stream was written to SAM
stream
= Physical datastream name of the data. For
example: all_1, all_2, etc
run
= Run
number
index
= First
digit of the 3-digit partition numbers of files in the set.
For example:
dayset-2007-12-03-all_4-238290-0
Contains raw data partitions 000-099 of stream all_4 in run 238290
dayset-2007-12-03-all_3-238290-1
Contains raw data partitions 100-199 of stream all_3 in run
238290
dayset-2007-12-03-all_3-238290-2
Contains raw data partitions 200-299 of stream all_3 in run
238290
Note that all streams from a given run do not
necessarily start on the same day. But all partitions in a
given stream and run will have the same starting date. The actual
definitions are of the following form:
% sam describe definition --defname=dayset-2007-12-03-all_4-238290-0
Definition Name :
dayset-2007-12-03-all_4-238290-0
Definition Id :
1084378
Description
: raw data run 238290
which started on 12/03/2007
Creation date :
05-Dec-2007 06:00:00 (UTC)
User
name : diesburg
Group
name : d0production
Dimensions
:
((((DATA_TIER raw and TRIG_CONFIG_TYPE physics) and
PHYSICAL_DATASTREAM_NAME all_4) and RUN_NUMBER 238290) and
FILE_PARTITION 000-099)
Raw Files, Events, KB/Ev
Columns 2-4 display information about the
raw data selected by the project. Column 2 is the
number of raw files selected by the project
definition. Column 3 is the total number events in
all files in the project. Column 4 is the average size of
the events in the project in KBytes. The size is
calculated from the total file size reported by SAM divided by the
total event count. The data for the raw files is
extracted from SAM with a command like:
% sam translate constraints --dim="__set__
dayset-2007-12-03-all_4-238290-0"
--summaryOnly
Unmerged Files, Events,
%Comp
Columns 5-7 display information about
unmerged thumbnails produced from the raw data in columns
2-3. Column 5 is the number of raw files which have a
descendant in SAM in the unmerged-thumbnail data tier that was produced
by the d0reco application of the appropriate version.
Column 6 is the total number of raw events which have a descendant in
SAM in the unmerged-thumbnail data tier that was produced by the d0reco
application of the appropriate version. Column 7 is
the percentage by event count of the raw data which has an unmerged
thumbnail in SAM.
Note that these numbers are determined by counting
constrained raw data, not by counting unmerged thumbnails and events
directly. The data for these counts is extracted from
SAM with a command like:
% sam translate constraints --dim="__set__
dayset-2007-12-03-all_4-238290-0
and file_analyzed > 0 and appl_name_analyzed d0reco and
data_tier_analyzed unmerged-thumbnail and version_analyzed
p20.11.01"
--summaryOnly
Merged Files, Events, %Comp, KB/Ev
Columns 8-11 display information about
the merged thumbnails produced from the unmerged data in columns
5-7. Statistics for the merged files obtained by examining the
merged files themselves rather than by constraining the raw file
selection. The list of merged files which match the
input raw dataset is first selected with a command like:
% sam translate constraints --dim="data_tier thumbnail and appl_name
d0reco and run_number 238290 and physical_datastream_name all_3 and
file_name recoT_%_mrg_0% and version p20.11.01"
The list of merged files is looped over and the
number of parents of the merged files is totaled to arrive at the file
count listed in column 8. The event counts of the merged files
are totaled to arrive at the event count in column 9.
Column 10 is the percentage by event count of the raw data which has a
merged thumbnail in SAM. Column 11 is the size per
event of the merged thumbnail events as determined by the total file
size reported by SAM divided by the total event count for the merged
files.
RecoCert Files, Events, %Comp
Columns 12-14 display information about the RecoCert
files produced from the merged data in columns 8-11. Each
merged file of a given dataset is checked to see if it has a descendant
with a file name of the form
"cert_mergedfilename.root". The parentage count of
each merged file which has such a descendant are totaled to arrive at
the file count in column 12. The event counts of the descendant
files are totaled to arrive at the event count in column
13. Column 14 is the percentage by event count of raw
data which have entries in RecoCert files.
Delta Files, Events
Columns 15-20 contain the number of missing files
and events for unmerged, merged, and RecoCert data in that
order. Note there is no additional consistency information
in these nummbers. They are simply differences caclulated from
the preceeding columns.
Status
Column 21 contains the completion status of the
dataset. The completion status is determined by comparing the
raw, unmerged, and merged file counts. The RecoCert counts
are not considered in setting the status. The
possible values of the status are:
COMPLETE
= No further processing to
do. All raw files in the dataset were successfully
processed and merged and the merged files stored in SAM.
FINISHED
= No further processing to
do. Some raw files failed reconstruction and cannot be recovered.
ACTIVE
= Processing still in
progress or not yet started.
RECHECK
= Status to be rechecked at
the next audit update.
The status of "FINISHED" cannot be unequivocally
determined from the file counts. If a dataset is merged before
all files have been reconstructed then it may be incorreclty flagged as
"FINISHED". "FINISHED" datasets will be periodically
reviewd to ensure their status is correct.