From mverzocc@fnal.gov Wed Apr 21 09:45:00 2004 Date: Tue, 20 Apr 2004 19:27:08 -0500 From: Marco Verzocchi To: d0dfwg@fnal.gov Cc: Terry Wyatt Subject: Comments from the WZ group on data format issues [ Part 2: "Attached Text" ] Hi here are the replies from the WZ group to the questions circulated by the Common Data Format Working Group. Frederic can comment on WZAnalyze, and Herb already knows what my concerns are regarding further code/algorithm development which is not done inside the D0 framework (or at least ported to it). Please note that the one person which is using ROOT trees inside the WZ group also says that the code he's developing needs to be ported to the framework. I'd be so much happier to hear the same thing from other people doing their tiny ROOT thingy. Cheers Marco Replies to your questions: 1. What analysis data formats and analysis tools are members of your group currently using? Right now a mixture of data formats are in use in the WZ group. Some analyses (even well advanced ones) are performed exclusively inside the framework, using ROOT only for display purposes. Other analyses make an extensive usage of the WZAnalyze ROOT tuples (which is a simple, column-wise ROOT tuple) which are in part generated centrally (to minimize problems with wrong settings of parameters) and in part generated by the users themselves. Some users create their own tuples, either starting directly from the thumbnails or from the WZAnalyze ROOT tuples. The WZAnalyze framework is easily expandable and many users have contributed to its development and/or extended the standard tuples for their own purposes. A few users have started to use TMBtrees. There are no official analysis tools at the level of ROOT (although there are a couple of prototypes working at the level of the ROOT tuple) and it's been for a long time of the WZ group that all official event selection code should be ported to the framework. This has been a relatively slow process, but now the Z-->mm analysis is fully available in the framework, as well as part of the Z-->ee analysis. The W-->mv/ev analyses should become available in a short period of time (2-3 months). There is also a package for the selection of diboson events which works inside the framework. MC should not be excluded from this review. The WZ group makes extensive use of the PMCS fast simulation which for this moment makes use of its own ROOT tuple as main format. The development of the code required to write thumbnails which can be transparently read by standard D0 framework software is delayed by the fact that there are not enough people with the adequate expertise to complete and debug the creation of physics object chunks inside PMCS, and by the lack of an appropriate structure for storing the MC truth inside the thumbnails. 2. What analysis data formats or analysis tools does your group recommend to its members? The only data analysis format which is recommended by the conveners of the WZ group is the thumbnail. We did not encourage the use the of the WZAnalyze ROOT tuple. It was written mainly as a tool to help the migration away from the RecoAnalyze ROOT tuple and as example code for users to help them in writing their own small ROOT tuple which is tailored to their analysis. It turned out to be relatively successfull as analysis format given the easiness of usage and as such it has remained in use for a time much longer than what expected (or deemed appropriate) by the WZ group conveners. The fact that all the corrections and algorithms are applied before the creation of the WZAnalyze ROOT tuples has the advantage the amount of errors committed by users in applying high level algorithms (or calculating luminosities or rejecting bad runs) are reduced to a single point of failure. Analysis code developed for the ROOT tuple is not easily ported back to the framework (the same is true also for any other ROOT format) and in addition it is not usually stored and versioned properly in a central location. This has lead to several problems for analyses which are conducted by a group of people. Many of these problems would be avoided by a proper usage of CVS, but that is more difficult to impose on people for code which does not rely on the build mechanism. The only analysis tools the group recommended by the users is the D0 framework. There are examples of code (properly stored in CVS) for at least two analyses (Z-->mm and diboson event selections) and more will come. To the knowledge of the conveners there are two attempts at writing analysis tools which work on the ROOT format. The first is a simple wrapper on top of the WZAnalyze ROOT tuple, to hide eventual changes in the underlying format (addition of new variables, this is to avoid the usage of MakeClass....). The second is the creation of a framework for the proper compilation of ROOT macros, proper execution of an event loop and proper merging of histograms between different processes running in parallel. The second example (Jon Hays) is interesting and is in principle independent from the underlying data format. 3. Do you encourage or discourage people to use tmb_tree? Why or why not? No we don't encourage the usage of TMBtrees. We see the TMBtrees as too large, containing too many things which are not interesting for our analyses, not containing informations which are useful/essential/crucial to our analyses, and lastly harder to use than the simple ROOT tuples. One thing that the TMBtrees authors seemed to forget was simplicity (in using it, in extending/reducing the scope of the classes) and user support. There haven't been too many questions about TMBtrees in the various D0 mailing lists, that is either a reflection of how widespread the usage of the TMBtrees really is or of how easy to use they are.... 4. How does your physics group support the efforts of analyzers? That is, does your group provide centrally managed data sets, tuples/trees, or analysis tools? We provide central support for the WZAnalyze ROOT tuples, ensuring that a reference version is supported and used to prepare ROOT tuples which can be used by most people in the WZ group. But the usage of WZAnalyze is relatively simple and some users generate their own ROOT tuples starting from samples they have skimmed or re-reconstructed themselves. 5. Would your group benefit from the availability of common, possibly centrally produced root trees? What requirements would a common root format have to fulfill for your group to benefit? Having a centrally produced and easy to use ROOT format would benefit our working group only under certain assumptions and it could have very detrimental effects if things work out in a different way. It could have positive benefits if these central ROOT format is simple to use, is regularly updated with the latest corrections, centrally maintained and supported (user support is usually forgotten when deciding these issues). To be useful to the WZ group it should contain the full trigger information (at L1/L2/L3, I don't think this is yet the case). A central ROOT format could be extremely detrimental to the WZ group if any further software development is conducted only or mostly at the ROOT level and if code is not designed from the beginning with the goal of inserting it into the D0 framework. Any software which is not designed with the goal of backporting it to the framework renders it relatively useless for some of the analyses to be made in the WZ group, which definitely require the usage of the full TMB++. Any software development in the area of tracking/revertexing/energy calibration/missing Et calculation which is not available inside the D0 framework may be sufficient for one or two analyses, but we will need to have all of them available inside the framework. If this is not the case (and there are already examples of this) the WZ group will not be able to benefit from those development and may have to repeat the work. We will need for the W mass analysis the ultimate performance from almost every piece of the detector and that will require that all the most recent versions of high level algorithms are available inside the framework. 6. If tmb_tree were chosen as the basis for a common format, what changes would be required to make it attractive to your group? Possibilities of reducing the branches without recompilation. Extensive documentation and instructions for the usage. Reduction in the usage of shared object libraries. 7. Does your group develop algorithms in root? Should algorithm development in root be encouraged? What is the best way to allow the entire collaboration to benefit from algorithms developed in root? The answer to this question is NO F4ING WAY. The reasons have already been explained at point 5. Any algorithm development which is not done inside the framework, or which is not ported to the framework is not useful for the collaboration and should be discouraged. 8. Is there any other information that you would like to bring to the attention of the Data Format Working Group? The Data Format Working Group should also consider, in addition to the various ROOT tree format other alternatives: * EDM ROOT: this is the simplest way of ensuring that code written in ROOT works exactly in the same way inside the D0 framework. * Code development for the D0 framework is deemed unfeasible because of slowness of the linking process. Some work should be done (this will also benefit the EDM ROOT project) toward reducing the code dependencies (for example reducing the number of libraries required for linking analysis code from the current number - larger than 150 - to a limited subset - say 20 - without prejudice to the large majority of analyses). That and the usage of compilation/linking machines or the local installation of libraries (which would be very easy if we really could go down to 20 libraries) would help reducing the code development time. I personally believe that the second of these goals is not unachieavable and could be obtained with only a small reduction of the physics capabilities of the code (hint: find a way of not having the full d0propagator inside the tracks and 70-80 libraries will disappear in a single step..........). * There is also alternative number 3, but it's too late for that. Comments from Heidi Schellman (some usage of TMB trees in luminosity ID group): ------------------------------------------------------------------------------- We used the tmb_tree structure to do the lum efficiency and acceptance studies. It worked fine once we threw out the build methods and wrote our own. I find the tree itself easy to use and well designed as a shadow of the thumbnail but the nonstandard code needed to build and manipulate it is very very hard to work with. I basically built my own makefile so that I could do anything useful with it as the scripts i provided are hardwired and adding other libraries was very hard to do. i The tmb_tree infrastructure has to be cleaned up so that the underlying nice product can be used. It also currently does not support calorimeter cell info - that isn't a big deal to add, I did it for EM. Comments from Drew Alton (using mostly WZAnalyze ROOT tuple): ------------------------------------------------------------- One thing we may want that no one else may want is the ability to read dst's. That (of course) assumes that dst's still exist then. Comments from Michiel Sanders (using mostly thumbnails): -------------------------------------------------------- Some obvious comments: * I would discourage anybody from developing general purpose analysis code in Root. That should all be done in the framework. Instead of spending time on writing Root stuff, people should focus on making framework code link faster! * Personally, I will write most of my stuff in the framework. Eventually, I may write my own, private, small, dedicated root-tuple. I don't think I'll ever use some centrally produced root-tuple (because I don't know what goes into it!) * At some point in the past, there was an effort to allow for Chunk access from Root directly (Marc Paterno?). What happened to that? And if that works, why do we then need a "tmb_tree" or equivalent? (this is a question to the committee, not to you personally) Comments from Andrew Askew (using mostly thumbnails and WZAnalyze ROOT tuples): ------------------------------------------------------------------------------- I recommend nothing. Given that there are so many options, the path of least resistance has been to let people just sort it out. The easiest way to start an argument is to tell someone what they must use. About TMB trees: I personally don't like them (though I wouldn't DISSUADE people, see #2). The concept I have of a ROOT based data object is: 1.) Simplicity: Variables should be easily traceable back to their TMB equivalents. 2.) Expediency: No ADDITIONAL overhead should be incurred by using ROOT files (this is the argument against TMBTrees, additional headers and .so are required, as well as a release that has them. Then there is more version control needed). 3.) Versitility: Basic quantities as well as derived ones should be available so that algorithm testing may be done quickly and then transferred back into the framework for mass production. About central production of ROOT-based files: Seems something like a return to the dark ages of reco_analyze. To have a single mass produced root file one would need to make certain that quantities that all analyses need are present, which is a MASSIVE task that almost certainly will not be met. I suspect the result would be a momentary condensation of analyses on the commonly produced tuples, along with hordes of complainers, and then a Balkanization of tools again as people try to make the information they need available. Comments from Yuri Maravin (used thumbnails, WZAnalyze ROOT tuples, moving to TMB trees): ----------------------------------------------------------------------------------------- I encourage to use tmb_trees since at the time when I switched to them, they had a better representation of the TMB chunk. More people used TMBTrees, so the chances of finding bugs are less. Also, Root C code is more readable in TMBTrees than that in WZAnalyze. I think it would benefit the whole collaboration to have a common root format (for example it would speed our pi0 research if we would have B-physics vertxing available). I would prefer to have a root format that would have a full representation of the TMB info. All photon ID algorithms that I develop now, will be ported to emreco and included in TMBs (similarly to hits on the road to EM object).