Summary and Conclusions
James T. Linnemann
Summary of new points raised during the workshop:
Triggering and Online Operations
- really missed PLAYBACK for debugging
- trigger overlap measuring tools not known/used online
- during shakedown, basic things useful
- getting CALIB to work
- L2 tool with programmable WAIT time
- L2 tool with TRUE or FALSE settable return
- L2 tool with random return
- anything requiring coordination across nodes is quite a bit of work
- prescaling important, tricky: random seed(node number)
to sample whole run
- statistics across nodes also tricky: processing times
- tagging events with names of trigger bits fired (done
in data logger)
- error message supression after n of a given type done per node;
hard to sum over system
- record keeping is vital but
- databases must actually get filled, and have extraction methods
- documentation of changes for releases must be summarized for users
- what is the bug
- what does it affect
- when did it begin and end (run and/or release numbers)
- was it an online or simulator or both
- changing definition and location of trigger bits a nuisance
- single source for trigger downloads and user description
- allow error message control w/o rull release
- hard to study crashes from text description: need database
- "mark and pass" vital:
record trigger info and write event whether passed or not
- for trigger studies (set to 100%)
- for monitoring (few %)
- global information flow monitoring weak: program an orphan
- Run 1b Level 2 display a big success: colors status of all L2 nodes
- needed in-node histograms of threshold curves
- special runs can saturate bandwidth: dynamic prescaling would be useful
- Run II will need faster begin/end run and download (shorter stores)
Simulation
- Separate executables for L1/L2/L3 simulation
- attractive for code management
- L3 still depends on L1, L2 to be available and correct
- more version numbers to know
- need framework to run without writing large data files
- need terse summary with ONLY efficiencies for L2 bits
- full simulation of trigger requires lots of knowledge to
operate correctly
- frustrating to wait for L1 fixes to use L2SIM
- often changes in triggers prove to make no difference to MC data
- need to organize and cross reference documentation better: WEB
- people prefer to have humans tell them what to do, rather than read
documentation themselves
- clumsy to do studies of rate vs threshold curves
- needed a filter setup for each point on the curve
- could have used ability to generate ntuple of cut quantities w/o coding
- parameterized trigger efficiencies in fast MC often enough
- failed to train L2 reps to solve simple problems
Verification and Monitoring
- trouble with baseline technique for new tools/functionality
- online trigger histograms
- hard for shifters to understand
- too dependent on trigger lists: resists automation
- graphical trigger rates vs luminosity very useful
- statistical tests alone insufficient
- trigger "cross sections" depended on luminosity
- beam conditions cause scatter at a fixed luminosity
- should display expected range when trigger changes
- run summary was not designed into system
- hard to modify
- had to parse a text file to extract numbers to an ntuple
- need to push harder to eliminate "known" problems--they distract
- online verification catches few problems by its nature, but is
- still essential for early detection of hardware failures
- almost only way to see history-dependent software problems
- verification will get harder as simulator divirges from online code,
hardware platform, or operating system
Code Development
- half of L2 releases to fix bugs; half upgrades
- large number of problems found by inspection of code suggest CODE REVIEWs
- small fraction of problems were found online via EDEBUG
- EDEBUG's clumsiness lowered this fraction
- did this justify overhead of running in DEBUG mode always?
- should have enforced "pass release only from CMS" rule more strictly
- too much code was released before adequate unit testing
- checklist of unit tests performed?
- MC does a poor job of predicting problems using real data (eg formatting
errors)
- error message formatting code often crashed when exercised online (bad data)
- memory limits in non-paging environment made resuse of offline painful
- strategy for sharing: L3 is 1st step of offline code
- nuisances from freezing whole base libraries, not L2 subset
- 40 libraries, 5000 routines in base release
- 3 months to change set of base libraries and recertify
- but work to specify the subsets
- and upgrades using new utility routines might be harder
- don't release library .OLB's if only production area is useable
- documentation of changes where?
- routine headers
- cms insertion comments
- separate release notes
- all of above?
- separate framework .EXE from physics filter .EXE
- faster framework/monitoring turnaround
- compilation on multiple platforms should be part of release
- Now is the time to interact with hardware designers on data
format optimization (eg ability to do binary search for desired data)