From rhauser@fnal.gov Fri Apr 30 16:28:42 2004 Date: Tue, 20 Apr 2004 08:36:14 -0500 (CDT) From: Reiner Hauser To: d0dfwg@fnal.gov Subject: Comments on analysis with tmb_trees Hi, Here are some comments on the analysis data format issue to be discussed in the Wednesday CSG meeting. I won't be able to attend, so I send this as input from somebody who used tmb_trees in an analysis and also was involved in producing them on a rather large scale for the NP group. First I like to mention that our decision to use tmb_trees for the analysis was based on two things alone: speed of developement and speed of the running code. We would have happily used the D0 framework if it would come close to working with tmb_trees in this respect. We accepted all the disadvantages that came with that (see below). The build system as it is now is simply too slow for any fast turnaround work as you need it in analysis (and also in other algorithm developement as I understand). Linking an executable that uses the basic data chunks still takes about 5 mins on a reasonable machine. Even doing a simple stand-alone program takes about 30 seconds while the build system is doing things that probably only Paul or Alan understand, but which are clearly not required for our analysis code. This stands in contrast to the about 10-15 seconds for compiling *and* linking our code with a simple Makefile and about 1-2 minute for recompiling everything. I did not understand the remark in the ADM talk that tmb_trees are slow (compared to what ?) because they are much faster than unpacking thumbnails. And we did not even apply any optimizations like only reading certain branches of the event, because it never seemed necessary. The analysis code code is not running in root or loaded dynamically into it. root is just too buggy and has to be restarted after any change anyway "just to be sure". Instead the stand-alone executable is linked to various root libraries which produce essentially histograms. Finally root is used as a browser for the generated histograms, plus a few macros to manipulate and combine them. As an example, when we decided to make various changes about 6 hours before the deadline of our note, we were able to re-run the whole analysis over our MC + skimmed down dataset in about 3 hours. This is a turn-around time which allows you to play around and explore/optimize your parameters. The additional conversion step is obviously very inconvenient and only somehow made easier by the central production done by the NP group. However, that step is needed anyway when corrections are applied and is combined with tmb_tree generation. The NP group created trees for 6 skims and it took about 1 week real-time each for p14.x (x >3) for pre-shutdown and post-shutdown data. For some skims this also corresponds to another filtering step (e.g 1MUloose) which you would do in any case. As long as there is a d0correct, I don't see how you can avoid this, (except by running all corrections every time you run your analysis), so the conversion to another format is just one additional step. Another disadvantage is that it is not possible to re-run algorithms which are only available in the framework or other root based format (example: vertexing). While we didn't need any of this, it probably would have been a show stopper if we had done b-tagging etc. Some problems in the existing tmb_trees were found, but the author of that specific code had moved to another format and didn't intend to fix or even look into the problem before the conference rush was over. tmb_tree based analysis have no common structured framework, the examples are usually pretty bad 'one big function' implementations. We basically added a 'TMBEvent' (which reads the next event and gives access to the data, plus acts as an intermediate store to pass results from one step around to the other) and a litte framework to run an arbitrary list of algorithms on each event. There are so many parameters in an analysis, that one needs a configuration mechanism instead of hard-coding things and changing the source code all the time. So we (re-)invented our own simple own configuration file mechanism based on simple 'name, value' pairs. As you can see, we saw the need to re-implement at least two pieces of the standard framework in a simplified way. While each was done in a few hours, it was of course also a waste of time, since it's always missing another little thing. Despite claims to the contrary, analysis on tmb_trees can be done on a laptop without the D0 environment: It needs about 5 packages (basically tmb_tree + friends) and a custom Makefile. The environment (gcc, root) has different versions than any D0 release but that doesn't matter for reading trees. Disk space even on a laptop is not a problem for a skimmed down data set (depends on your analysis, of course). I would like to note again, that, despite all these obvious problems, we never considered using thumbnails and the standard framework (which we have done in the past for various things). cheers, Reiner -- Reiner Hauser Email: rhauser@fnal.gov Tel: (630) 840 8634 Fermilab, PO Box 500, MS 352, Batavia, IL 60510-0500