Overview of L3 Algorithms Section for Operations and Upgrade Document --------------------------------------------------------------------- --Introduction and Overview of L3 Section An L2 accept causes full readout of the event to take place. The single board computers in each front-end readout crate send their data to one of the ~100 L3 farm machines. The two functions to be performed by the L3 system on each event are as follows: - Event Building: the complete event raw data chunk is built from the data received from the front-end readout crates. - Event Filtering: Guided by L1/L2 trigger information: * Perform partial unpacking/reconstruction of raw data using fast algorithms. * Select which events should be recorded. * Select to which (exclusive) stream each recorded event should be sent. For the purposes of monitoring the performance of the L3 trigger the following additional actions are performed: * The results of the L3 event reconstruction added to the event data structure for each recorded event. * On a small fraction of randomly chosen "Mark and Pass" events: + the events are recorded irrespective of L3 filter decision. + extra "debug" information is added to event data structure. * Statistics are collected online on CPU time consumption for each tool and the pass rates for each L3 trigger. --Very Brief Overview of the Run 2a System: - The boundary conditions (input/output rates and event sizes) under which the system is designed to operate are as follows: Input: 1 kHz at 300 kByte/event. Output: around 50 Hz average, system must be able to deal with ~80 Hz peak? - The L3 farm comprises 100 * 1 GHz CPUs running Linux. About 15 ms/event are needed for input/event building/output. Since the input rate to L3 is 1 KHz, this leaves about 75-85 ms/event for unpacking, reconstruction and filtering. (It is probably safe to assume, on grounds of stability and efficiency of operations, that we do not want to try run the system at the very limit of its resources.) - Scriptrunner Scriptrunner is the program that controls the running of the L3 software and determines the L3 trigger decision on each event. In order to save processing time only a partial reconstruction of each event, depending on the L1/L2 trigger information, is performed in L3. Each L2 bit that fires causes one or more L3_filter_scripts_ to be run. If any filter script returns .true. the event is flagged to be recorded. Each filter script consists of the logical .AND. of one or more _filters_. Each filter requires the presence of one or more physics objects satisfying given criteria. These physics objects are produced by L3_tools_ that are called by the filter. Tools may themselves call other lower level tools to provide the input data they need. For example, the electron tool calls the calorimeter cluster-finding tool, which itself calls the calorimeter unpacking tool. The _trigger_list_ allows flexible definition of: * which L3 filter scripts should be called on each L1/L2 trigger bit * which filters make up each filter script * which tools are called by each filter * the variable parameters of each tool and filter (e.g., p_t cuts, cone sizes, etc.) A number of other features of the way L3 operates are designed to save processing time: * When each tool is run the results of are saved in case this tool is called again in the same event by another tool or filter. * When a particular filter returns .false., any subsequent filters in the given script are not run (since the script will, anyway, return .false.). [Terry's note: for precision physics I still worry about this.] - Algorithm software Currently running online in L3 we have the following tools/filters: [ "->" can be read as "which calls"] * calorimeter cluster tool -> calorimeter unpack * jet filter -> jet tool -> calorimeter cluster tool * electron filter -> electron tool -> calorimeter cluster tool * tau filter -> tau tool -> calorimeter cluster tool * muon filter -> local muon tool -> muon unpacking tool * global track filter -> global tracking -> smt and cft unpacking A lot of effort has been devoted recently to getting "offline" quality treatment of the raw data in the unpacker tools. For example: * All unpackers other than calorimeter are fully dynamic (i.e., they determine the readout configuration from the data themselves). * Channel-by-channel treatment of thresholds and treatment of noisy channels are performed for the tracking detectors. * Close to offline-quality geometry is used for the tracking detectors. * Channel-by-channel treatment of calorimeter non-linear corrections and gains are performed * Dynamic killing of hot cells in the calorimeter is performed. In order to improve the E_t resolution in the calorimeter we are hoping to have certified for online use in the next few weeks a tracking-based tool to find the z coordinate of the primary vertex. Many other tools and filters will become available online on a somewhat longer timescale. These include: * hit-based primary vertex tool * missing E_T tool * cft-only tracking tool * cps and fps cluster finding and unpacking tools * tools to associate objects in different detectors (e.g. track to muon) * tool to provide b-tagging by impact parameters and displaced secondary vertices * tools to calculate "physics" quantities (e.g., invariant mass, delta_eta) * tools to identify physics event types (e.g., W, Z, stream definitions) - Expected evolution from where we are now to run 2a design luminosity * Currently: + Tevatron lumi ~ 2 * 10^31 (which is a factor of ~10 below run 2a design) + L1 is currently limited to ~100 Hz (by DAQ instability and the absence of rejection at L2). A consequence of this is that a rejection factor at L3 of ~5 is adequate. + Calorimeter trigger instrumented only to |eta|<0.8 + No L1 track trigger and tracking detector readout incomplete * Steady evolution envisaged as luminosity increases, with the most likely discontinuities coming from DAQ and L1/L2 trigger changes: + As L2 slowly turns on (which is in the process of happening now) more discrimination will be needed in L3 to maintain factor 5 rejection (particularly in lepton filters). + Improvement in L3 DAQ rate (which is expected to deliver ~500 Hz input rate to L3 by end May) similarly allows L1 prescales to be reduced and requires greater discrimination from L3. N.B. It is very difficult at the moment to say whether or not we have adequate CPU power in the L3 farm (given the low luminosity and the very incomplete nature of the detector, L1/L2 trigger systems, DAQ system, and the fact that we are currently running only a small sub-set of the finally envisaged L3 tools, filters and monitoring). The hope is that we shall have a much better measurement of our CPU needs by mid-June; by then we expect to have experience of running at higher luminosity, DAQ rates and with a much more complete trigger list. However, a reasonable guess might be that an increase by roughly a factor of two in the CPU resources of the L3 farm will be needed to give the required performance at design luminosity. - Standard certification and verification requirements see: http://www-d0.fnal.gov/computing/algorithms/level3/meetings/talks/certification.html Technical issues associated with tools/filters, which are currently open questions: * How should L3 filter scripts be implemented in cases such as electron and muon filters, where there is a lot of redundancy in our ability to trigger? Should we have many filter scripts hanging off the same L1/L2 bit? Should we have a single filter script that calls a tool to give the .or. of several independant selections (and stores detailed information on how the trigger decision was arrived at in its L3PhysicsResults block). The former solution might lead to an explosion in the number of L3 triggers needed. Remember that we shall be hoping to have some parallelism at L1/L2 in our electron and muon triggers. See also the discussion on tools for physics analysis below. * Number of L3 trigger bits The L3 system has been designed in such a way that the number of L3 trigger names could easily be increased beyond the currently implemented 256. All that would be needed would be for the number of words in the itc_header reserved for this purpose to be increased. The L3 group was working under the assumption that this flexibility was required and that no decision had been taken to fix the design to a maximum of 256. However, it appears that in several places "downstream" of L3 the number 256 has been cast in stone (collector, datalogger, distributor, sdaq, monitoring and recovery, event catalog?) * Event identification (e.g., W, Z) tools. These would need to be called whenever an L3 filter script designed to pick up high pt isolated leptons passes an event. One possible way of implementing these would be for a W/Z filter to be added at the end of every such L3 filter script. The only purpose of the W/Z filter would be to call the W/Z tool. Since the results of the W/Z filter should not affect whether or not the event is recorded the W/Z filter would always return .true. W/Z phyisics objects would then be available in the L3 data for the purposes of streaming and monitoring. --Items where (much) more work is needed: - Monitoring/Quality control/Routine verification of new releases - What currently exists: - L3 monitor statistics online for each run For each filter script and each filter within that script, numbers of calls and passes are available to shift crew and archived. Information on timing and memory usage is available online but some effort needs to be found to get this displayed in the control room. - l3fanalyze offline * Program exists to read "physics_results" and "debug_info" added by L3 to the event data structure and fill into rootuple. Each tool author is required to provide the necessary code for their tool. * Individual tool/filter authors have (at the moment private) code/macros to produce histograms, study performance, etc, from l3fanalyze rootuple. - Aims: * A standard set of checks that can be run with each new release of the L3 filter code. * Systematic monitoring on a run-by-run basis of the performance of the L3 tools and filters that are running online. - Work in progress: * Define set of standard test samples. * Central job submission with standard trigger list to produce rootuples * Standard macros for reference plots. * Implement shadow nodes: allows subset of events to be sent, in parallel to the normal L3 farm, to a test machine that can run development code/calibrations/trigger list. - Tool needed: A tool is needed to provide something approaching a "bit-wise" comparison between the L3 chunk produced online and that produced by running the simulator offline. - Making monitoring a routine control room activity for shift crew * Run l3fanalyze online as an "examine" to produce rootuple * Employ root macros to read rootuple and display standard set of monitoring histograms for current run + compare with reference histograms Note: There is a significant amount of overlap here with what is needed to monitor other parts of the trigger and online system, particularly with L2. We are trying as much as possible to pool our limited manpower in areas of common interest. - Can we do more sophisticated online monitoring in the L3 nodes? (L3 sees data at 1 kHz and does a pretty complete reconstruction of these data.) * For example, collect histograms, measure efficiencies * Make use of the 95% of the events that we reject? For example: + Measure trigger turn-on curves (for L1 and L2 as well as L3) + Do background studies (Why write out events and have the huge overhead in having to run offline reconstruction and storing them permanently if they are needed for relatively simple operations that can be performed adequately in L3? How about writing a stream with L3 reco information but no raw data, e.g. QCD low Et jet data? The requirement that 17 different jet algorithms be run might make this a non-starter.) * Best way to concatenate results from monitor processes running on each of the 100 L3 farm nodes not worked out yet. * Will require extra resources at L3, but the potential return (in terms of spotting trigger problems and in saving offline resources) might make this a very cost-effective investment. This might also be the case if we find that lack of CPU power is limiting the sophistication of the event reconstruction and/or filtering that is possible in L3. - Does L3 need a dedicated offline farm for testing/monitoring? (These nodes could be used by the standard offline farm when they are not needed for dedicated L3 use.) - Calibrations/Geometry issues for L3: - There is a lot of work to be done here, which has barely started. - As described above, we have taken the approach of implementing "offline" quality unpacking of the raw data for L3. Our studies showed that we needed this in order to achieve adequate performance. This does have some consequences for complexity and execution time. - In which format should calibrations/geometry data should be input to L3? (At the moment each L3 subsystem handles this differently, if at all). - How do we make sure the correct calibration/alignment is downloaded online? - Mechanism for download and keeping track of which calibration/alignment versions were used online for which runs. - At the moment we have a number of flat files containing this information for different parts of the detector, that have to be distributed to the L3 nodes. One approach that we might adopt is to have a master L3 configuration (flat) file that would contain the names and version numbers of all of the other calibration/geometry files. It could contain also information such as the node to which L3 monitor information should be sent, which currently resides (somewhat inappropriately) in the trigger list. Such information cannot be put into an RCP file, because it cannot be tied to a particular release. The master L3 configuration file would be logged in the runs database. - Many of these topics are issues for other parts of the trigger and for individual subdetector groups. Common solutions are clearly desirable and some discussions in the D0 online group as a whole have started. - L3 interaction with the trigger database - Trigger database allows: + L3 tools, filters and filter scripts to be defined along with their status (current, future, local?) and their input parameters (names, types, defaults, allowed ranges). + Triggers to be defined consisting of L1/L2/L3 terms and the relevant parameters. + Trigger lists to be defined (consisting of lists of triggers). + Easy user access to official trigger lists and detailed information on specific triggers. - Current limitations: + Because the L3 system is not allowed to interact directly with the database, it uses flat files (tools.rcp, filters.rcp) that contain lists of the available tools and filters, defines their status and defines their input parameters. This is OK, but currently these files: * Can not be generated automatically from the trigger database. * Are tied to a code release. Probably they should not be? + Only triggermeister has write access. In the official area for triggers that are actually run online this is probably as it should be. However, we really need a users' area for tool authors/physics groups to develop and test new tools, triggers and lists. This was part of the database specification and the project will not be finished until this is provided. - Streaming infrastructure Some of the basic infrastructure for L3 to define more than one output stream exists and it has been been tested online that the data logger correctly produces more than one output stream. However, full implementation of the complete scheme needs more work. - Monte Carlo simulation - Basic L3 simulator is described in the Trigger Simulator chapter. (But see next section.) - Tools for analyzing trigger performance, etc. Many of the challenges here are common to the trigger system as a whole. The L3 group will need to find a significant amount of manpower to contribute ideas and solutions to these challenges, but this will be an important issue for D0 as a whole. - L3 Experts (e.g., tool authors) In order to verify the online filter results by running the simulator on real data one has to take into account the fact that simultaneous multiple runs mean that there is not a one-to-one correspondence between the trigger list and trigger bits. At the moment this is extremely inconvenient. See, e.g., http://www-d0.fnal.gov/computing/algorithms/level3/meetings/talks/yann_270202.ps.gz, requiring a lot of "by hand" interventions to the level.sim file even to analyze a single run. This has to improve soon! - Users (Physics Analysis) a) Expert Users The most demanding physics analyses in terms of understanding trigger efficiencies will require a high level of expertise. The upgraded D0 detector provides a high degree of redundancy for triggering on electrons, muons, taus, etc. The L3 (and other levels) triggers will be designed to make maximum use of this redundancy to maximise the efficiencies and the accuracy with which they can be determined. Understanding all this at the level of physics, algorithm and detector performance will be complicated enough! The L3 algorithms group will have a responsibility to minimise the additional technical hurdles that users have to overcome by providing tools to access detailed information about the L3 trigger decision. - Simplifying life for the average physics analyser Many analyses do not require a high precision for trigger efficiencies and can not afford the overhead in understanding the fine details of exactly how the trigger works and how it varies with time. Physicists doing these analyses need to be provided with simple to use tools. For example: * How to use the "recommended", "best", "simplest", "most robust", single electron trigger. This probably needs to encompass L1/L2/L3. It will have to handle for the user variations of the trigger definitions with time and provide tools to help the user make efficiency/background estimates. * Tools to associate physics object at different stages L1/L2/L3/RECO/(MC-truth). Some work has started on this. * How do we generate Monte Carlo samples that give as accurate as possible luminosity-weighted simulation of actual L3 performance? - Thumbnail A design for the infomation that needs to be stored in L3 part of the thumbnail was put together ~2 years ago. This needs to be re-visited in the light of recent experience. In common with the rest of the trigger no code to implement the thumbnail has been written. - Documentation An L3 algorithms web site is maintained at: http://www-d0.fnal.gov/computing/algorithms/level3/home.html. Minutes of meetings with electronic copies of all talks given are kept up-to-date. However, much of the documentation of central L3 infrastructure and individual filters/tools is out of date or missing. Urgently needs work. --Miscellaneous - Summary of areas where extra manpower needed - Central L3 software It is in the provision of central L3 software (code management, infrastructure, tools) that we have most urgent need for additional manpower. Topic FTE currently extra FTE active needed Scriptrunner + central L3 code infrastructure, release management 1.0 1.5 Streaming - 0.5 Monitoring/Quality control: * quality control macros - 0.5 * migration to online - 0.5 * "bit-wise" on/offline check - 0.5 Calibration/alignment technical infrastructure - 1.0 Development of "user" and "physics analysis" tools: - >1.0 L3 thumbnail - 0.5 Such work clearly qualifies as a "service" contribution to D0. Groups that have new students or postdocs might consider steering them in one of these directions. Although rather technical, some of these projects would be an excellent way to learn about all parts of the D0 detector and the identification of the different types of physics objects, and would be an excellent preparation for physics analysis. Similarly, new groups seeking to join D0 might be asked to make a contribution in these areas. Code exists and is reasonably well tested to unpack the data for most of the individual subdetectors. However, work will be needed for all subdetectors to handle the calibration/alignment issues discussed above. A lot of excellent and productive work is going on in the development of individual physics tools and filters. Of course, new people are always needed to join this effort, particularly as several authors of important code have moved on (to other activities in D0, or have left D0). A few specific areas where extra help could be used: * Testing/developing hit-based primary vertex finder --Run 2b Upgrades Difficult to say much with any confidence here! - The boundary conditions (input/output rates and event sizes) Input: ??? kHz at 500 kByte/event? Output: around 200 Hz? Certainly depends on the physics aims we are aspiring to for run 2b. - Farm hardware Almost certainly will need a substantial upgrade. But by how much? - Software ???