This is a prelim (and iterative) attempt at the description of wDSP error treatment and reporting by the SLICs. Most importantly, it contains the guidelines for error treatment by code developers. Comments/corrections/additions welcome. Arthur ---------------------------------------------------- http://www-d0.fnal.gov/~maciel/l2alg/wDSP_errors.txt ---------------------------------------------------- last update: August 8, 2001 Section-1 GENERAL REMARKS ~~~~~~~~~~~~~~~ By framework we mean the portion of the code that is the same for all SLICs -- largely the code that configures and drives the DSPs. By algorithms we mean the portion of the code that is DSP specific, namely the "Unpackers" and the "Algorithms" in the case of worker DSPs. Worker DSP (wDSP) errors are reported event by event in the output trailer word which is an error mask whose bits indicate the various error types. Error masks are then accumulated by DSP5 into histograms for statistical monitoring. These error histos, together with timing and occupancy histos, are normally accumulated by DSP5, who will send this monitoring info out upon a system's request. wDSP errors occur in two "types": Event(simple) errors , Fatal errors and in four "flavors" -- see below. Section-2 ERROR "TYPES" ~~~~~~~~~~~~~ Event(simple) errors; An error from which it is possible to recover (typically by neglecting data and proceeding normally) will be referred to as an event error. Event errors will only set bits in the event error mask, and will be propagated all the way to tape, so that the analysis knows that muon-L2 is possibly corrupted, "analyze at your own risk". Fatal errors; Fatal errors (typically loss of event alignment) will probably cause he the death of the system, but not necessarily. Only DSP5 has authority to request SCL-init. If, at any stage, wDSP runs into serious trouble, it returns a *negative integer*, bypasses the rest of the event processing, sends headers&trailers only, and proceeds to the next event. DSP5 may be able to recover from that error, but a fatal error in DSP5 can be really fatal. These will either crash the system, or maybe elegantly, have DSP5 request an SCL-init over the whole detector. In case of a crash, the next system downstream perceives it, and calls the SCL-init. As a general rule, every (int)function returning an error code will; (1) Perfect processing ==> return 0; (2) Fatal error ==> return -1; or any negative integer,acc.to error code. skip rest of event -> send heads&trail only (4) Event errors ==> return any positive integer, acc.to error code, or number of recoverable errors found. Section-3 ERROR "FLAVORS" ~~~~~~~~~~~~~~ wDSP Processing Stages are (see l2slic/src/admin/Main.c); (1) Run Init - A purely "framework" business, DSP configuration based on RCP params. Num of active channels, expected mod_id's, set pointers to lookup tables, etc... (2) Event (infinite) loop - here the framework interacts with the algorithms. Sub-stages are; - GetRawEvent a framework-only stage, reads in one full event from the external fifo. Checks the data origin (channel number, etc) and the event alignment against the SCL-evt-number. Checks various sources of data counts. - PrepareEvent the framework presents the data, the unpacker zeroes-out and re-fills the target structures (detector sensors) with hits. - GlobalAlgorithm is the track search (hit processing). - TransferData shipping the output thru the serial port If any of these (high level function) stages returns a "-1", then all subsequent stages are by-passed, only headers&trailers are sent to DSP5, who in turn will decide whether to call SCL-init. Section-4 WORKER DSP ERROR HANDLING ~~~~~~~~~~~~~~~~~~~~~~~~~ For every event, errors are sent as a bit mask in the worker trailer word. These bits are assigned by the wDSP framework in src/admin/Main.c according to the returned values of each processing stage; /* There are four processing stages in WorkerIteration(); GetRawEvent() - Input PrepareEvent() - Unpack *global_algorithm)() - Process TransferData() - Output If any of these stages returns a negative integer, the following stages are bypassed, except for the output, which will produce only headers and trailers */ Thus, event errors come in four different "flavors", either of fatal or non-fatal nature. Each flavor has a corresponding bit in the error mask. A fatal error (bypass rest of event) sets two bits, flavor+fatal. The error mask (trailer word) is built in l2slic/admin/Stats.h, and is shown in http://d0server1.fnal.gov/users/maciel/l2docs/formats/wDSP_output.ppt(.ps) If a wDSP sets the "fatal" bit, this effectively means that the event is truncated, track stubs may exist that are not being reported. Fatality in either Input or Unpack stages imply zero stubs found on this DSP, since processing is skipped. Processing however may find/report some stubs before hitting a fatal error and bailing out. An output error is never fatal, and only shows at the next event. It is there for histogramming purposes only. A severe output error that results in header absence or corruption of event number will be detected by DSP5. TransferData is always called for output, sends only headers and railers in case of a fatal error. Section-5 WORKER DSP ERROR RULES ~~~~~~~~~~~~~~~~~~~~~~ These are the detailed rules for the wDSP algorithm developers, for the integer returned values of the wDSP processing stages; (1) "Input" (jbk+am) - is a framework job (2) "Unpackers" The unpackers normally do (i) a cross check on data origin; this is done per-channel based on the channel headers that get passed by BlocDesc, the data descripton provided by the framework. This will be an internal consistency error because the framework itself has configured the system. If the data has really come from a wrong cable, GetRawEvent will already have noticed it beforehand. These data origin errors must ==> *** return -1; *** obs: the framework will then SetUnpackErrorBit(); SetFatalErrorBit(); (ii)a cross check on every hit. Unpackers will return the number of bad hits, i.e; if(bad hit) nerr++; *** return nerr; *** What is a bad hit ? anything unexpected (per hit). Hit address off array limits (e.g incompatible with detector geometry). Unphysical hit value, etc... The bad hit is simply dropped by the roadside, and unpacking continues normally. These data corruption errors must ==> *** return #errs; *** obs: the framework will then SetUnpackErrorBit(); (3) "Algorithms" Any recoverable imperfection in the running of a worker algorithm returns a positive integer (e.g. nerr++;); SetProcessErrorBit(); Upon any truncation, ==> *** return -1; *** obs: the framework will then SetProcessErrorBit(); SetFatalErrorBit(); Examples; Execution is interrupted (truncated) because of time out, too many stubs, etc. DSP runs into /0 or nonsense variable, etc... (4) "Output" (jbk+am) Section-6 WORKER DSP MONITORING ~~~~~~~~~~~~~~~~~~~~~ As part of monitoring, the framework measures the duration (in microseconds) of each of the processing stages above, for every event. Such times, as well as buffer occupancies in the "GetRawEvent" stage, are reported as header words for every event. Times and occupancies are accumulated as histograms by DSP5, who will send this monitoring info out upon a system's request. Currently, the 2nd header word is destined to carry the worker monitoring. The raw information is "bin number" (minimizing DSP output is a must for time-saving). Each item is monitored by a 16-bin histogram, with bin number stored in a "nibble" of the header word. The header word can carry up to eight histos, and those currently planned, and coded in, are tabled in http://d0server1.fnal.gov/users/maciel/l2docs/formats/DSP_output.ppt (.ps) wDSP MONITORING HISTOGRAMS (to be Accumulated by DSP5) ~~~~~~~~~~~~~~~~~~~~~~~~~~ units for time are microseconds, units for size are words (4-byte words) the highest bin counts also the overflow histo bin widths are uniform Times histogram name |n.of bins|n.of bits| range | resolution ---------------------|---------|---------|--------|----------- input | 16 | 4 | [0,15] | 1us unpack | 16 | 4 | [0,15] | 1us process | 16 | 4 | [0,15] | 4us output | 16 | 4 | [0,15] | 1us ---------------------|---------|---------|--------|----------- Sizes histogram name |n.of bins|n.of bits| range | resolution ---------------------|---------|---------|--------|----------- input event tot.size | 16 | 4 | [0,127]| 8ints rogue channel | 16 | 4 | [0,15] | chan# quiet channel | 16 | 4 | [0,15] | chan# idle for now | | | | ---------------------|---------|---------|--------|----------- quiet channel counts++ when channel data size = 0 (only Hs & Ts) rogue channel counts++ when channel data size > some thresold If two channels happen to be either rogue or quiet, only the highest one (last entry) gets reported. Others are overwritten. Tentative threshold for "rogue channel" entry is 64 ints, i.e. rogue = above the half-max histo(ev.size) value. -------------------------------------------------------------------------- Note; DSP5 and SLIC error handling and reporting (and monitoring) are dealt with in a sparate document