Lessons from Level 2
James T. Linnemann
Michigan State University
May 22, 1996
Incorporates comments from Jim McKinley, Steve Linn, and Terry Geld
Areas Covered
- Level 2 in practice as a trigger
- Level 2 as a physics simulation
- Level 1, 1.5 as well (good idea?)
- Verification of Level 2
- MONITOR stream and test release verification
- Level 1, 1.5 as well
- Level 2 Code Development
What Do We Want to Learn?
- What was done well
- What wasn't done well
- What wasn't done at all
- What was done that wasn't needed
- What will be new in Run II Level 3
Emphasis
- My personal opinion is that goals were broadly met, with
a lot of hard work
- Hope that discussion sessions will be active, dominated by
users
- I will try to avoid being overly defensive
- My (and other ) remarks will probably err towards overly
critical: apologies
Manpower
- Critical task; need (and had) good people
- More people needed than we had
- "Service" work is harmful to your career
- Is this a problem in D0's culture?
- Do we rotate people among tasks too fast/slow?
- Can Institutional Responsibility help this???
- Things we knew we needed came later, or never, or uglier
than we had planned
Run the Offline in Level 3?
- 5 X speedup just to maintain budget
- more rejection upstream already
- same output rejection is planned
- L3 sees 50 X as many events as RECO farm
- pay 50 X as much, or run less or faster code
- Consequences of different code:
- Effort in certifying algorithms (no test beam?)
- Thresholds rise by 2-3 sigma + Delta(Algorithm) each level
- Plan now to get the calibration/alignment
Triggering
- Overall frame design seemed adequate
- Shadow mode vital for testing
- Trigparse needed to provide defaults
- Rejection, Et resolution a bit low
- Tracking little used in electrons
- Was changing L2 too hard? or too easy?
- Speed means (almost) fastest first
Triggering/Operations
- GM's could have used more automation and guidance, esp.
for new trigger versions
- Too many error messages for DAQ shifters
- Inadequate Record Keeping for Users:
- Failed to tag all candidates with L2 bit
- Offline data base never loaded
- Master Database of Run, trigger version, L2 version never
materialized (to my knowledge)
- Need faster run start or prescale download?
Hitfinding
- Data Compression is a possible function of L3 if output
bandwidth is a limitation
- L2 Hitfinding was late, and a lot of work, and (as I recall
it) not used very much
- This wasn't the only code we developed but did not use
- Long leadtime often a factor in unused code: forces conservatism
Simulator Goals
- Detailed simulation: verify trigger behavior
- For L2, done by running same code
- never really replicated data layout in nodes
- Physics simulation tool
- could be less detailed, but then more coding
- trigger simulation is not time-limiting step
- L2 has to run 50 X faster than offline
- Estimate trigger rates
- never handled weighted events properly
- needs vast statistics (hard for accurate raw format)
Simulation
- L2 = simulation accurately BUT no alpha version
- Adding L1, L1.5 simulations a LOT of work
- Separate, piped executables?
- Less interaction of simulation Vs. online release needs,
"backfits" (but more versions to track)
- Hard to use
- MC Data not properly self-identifying
- logicals, search lists confusing (RCP's)
- needed more output to debug without adding code
- Standard "which cut failed" counters; histos of
variables
Simulation, Continued
- Tricky to recompile with user code
- less need if more standard output?
- e.g. # failing tool for successive cuts
- histos of variables being cut on
- Inconvenient to have RCP's inside an STP
- This is intrinsically more complex than RECO
- must select appropriate input files for data
- Better Documentation (less jargon?) needed
- Centralize (WEB), not in library OR in D0NEWS
Simulation Questions
- Needed RAW or STA (DST?) input
- no uDST;
- no "try L2 on this RECO object" (ill defined?)
- were enough data written out?
- were the simulations used in practice?
- Correlation matrices between L2 bits?
- Better Databases, "restore run", or both?
- Anyone use away from FNAL? needed to?
- Support on multiple platforms next time?
Online Verification
(MONITOR Stream)
- Came late, but was useful
- Limited: in L2 can only find history bugs
- But they are hard to find without this
- Not all discrepancies were fixed
- Crashing events couldn't be dumped
- No Playback even if they had been--hard to debug
- "not important"--ELN Vs VMS runtime library?
- L1/L1.5 sim may be I/O (table) limited
- Some L1 Cal monitoring done late/never
Offline (Pre-Release) Verification
- Came later than desirable (manpower)
- Intended to test code already author-tested
- eventually came closer to this goal
- Vital to catch errors before release
- Hard to get up-to-date data samples
- do they test enough? (Coverage)
- Too much debugging by L2 experts
- bugs may affect online, simulator, or both
Code Development
- Lint, Code Review, Assertions next time
- More systematic testing before release!
- More standards?
- e.g."reason failed" for candidates
- Weak point: communication between tools
- ESUM insufficient; tool outputs tricky to use
- again, tag by bit useful
- but don't try to outguess the frame
- Speed means tricky code (e.g. memory of previous processing)
Code Development, II
- ZEBRA banks did too little for the coder
- result was less user utility
- eg. no D0X, ntuples for many L2 banks
- Hope for a better debugger next time
- ELN quirky (wrong source code, multi-D arrays)
- no DBANK, EZBANK
- how we paid for data structures hidden from compiler
- Reuse of offline code a 2-edged sword
- control becomes complex (L1, L2 in same .EXE)
- memory-hungry (paging next time? or too slow?)
Code Releases
- Must have Production release, version stamp
- with care, replaced alpha/beta releases
- Production release system was late
- Constants, trigger hardware not fully captured
- Database had no "releases"
- Code only from standard CMS
- Timing of releases inconsistent with RECO
- so not released from single production area
Framework
- Flexibility was sufficient
- most was used, except for "force pass", ordering
- Binary scripts and tool database a nuisance
- script download packing a nuisance
- How many bits?
- used 100 for 30 L1 bits, so 256?
- many were prescaled for background studies
- Not all bookkeeping went to databases
- Hard to monitor, simulate mixed farm
Framework: New Rules?
- Allow I/O?
- not allowing it was clear, but a pain
- databases? How to control traceability?
- enough memory to load all of FATMEN?
- Histogramming next time?
- memory, code, file gathering...
- Lots of work to support monitor stream
- processing time is a bad thing to cut on
Framework: Bells + Whistles?
- use something like objects, .PBD's to describe tools and
their hooks?
- begin, end run hooks (e.g. statistics, histos)?
- multiple node cross check during run??
- prescaling upgrades?
- floating point number factor (1.40)
- more preset values Vs
-
dynamic bandwidth allocation?
Level 2 Libraries
- Segmentation was confusing
- Filter_util, d0filter, d0daq, L2control, vms_filter
- Level1, Level2, calor_filter, calor_util,
Summary
- Overall a success
- Absorbed a lot of resources
- Long lead time
- hope to preserve basic framework