From gwatts@phys.washington.edu Thu Apr 29 16:58:27 2004 Date: Wed, 14 Apr 2004 09:37:59 -0700 From: Gordon Watts To: Aurelio_Juste , greenlee@fnal.gov, d0dfwg@fnal.gov Cc: quadt@fnal.gov Subject: RE: Common root-based data format. Hi Aurelio, Thanks for all the answers! What do you mean by not being DZERO specific? For example, things like TMB Tree and other things can be built outside the DZERO framework. Does that make them non-dzero specific? Do you call the top_tree non-dzero specific? Cheers, Gordon. -----Original Message----- From: Aurelio_Juste [mailto:juste@fnal.gov] Sent: Wednesday, April 14, 2004 6:31 AM To: greenlee@fnal.gov; d0dfwg@fnal.gov Cc: quadt@fnal.gov Subject: Re: Common root-based data format. Dear Herb and D0DFWG members, please find below our replies to your questions: > 1. What analysis data formats and analysis tools are members of your > group currently using? top_analyze is used to produce top_trees, which is the common root-based data format used in the Top Group. Some people analyse top_trees in the makeclass style of ROOT, other use the top_tree reader. Some part of the single-top group put some additional framework and software tools on top of that. > > 2. What analysis data formats or analysis tools does your group > recommend to its members? For the sake of uniformity among all top analyses, we require everybody to use top_trees, either using makeclass or the top_tree reader to analyze them. > > 3. Do you encourage or discourage people to use tmb_tree? Why or why > not? We have discouraged the use of tmb_trees within the Top Group. The reason is largely historical. The official data format at D0 has been TMBs, not tmb_trees. Top_analyze was developed to address the issue of centralized root-tuple production ensuring the use of corrections to objects (well before the development of d0correct), calculation of topological variables, kinematic fitting, etc, in a completely uniform way within the Top Group, as well as to provide fast feedback, immediate bug-fixes and 24/7 support (by small group of experts). > > 4. How does your physics group support the efforts of analyzers? > That is, does your group provide centrally managed data sets, > tuples/trees, or analysis tools? yes, top_trees have been centrally produced for the winter conferences, both for data and MC. All the tools inside top_analyze are common. Many other tools, at the root level (e.g. a package to compute trigger efficiencies, etc) are also made common to the whole group. We try to make the root level common tools be independent of the analysis framework. In most cases a small group of developers implement the code and maintain/support it. > > 5. Would your group benefit from the availability of common, possibly > centrally produced root trees? What requirements would a common root > format have to fulfill for your group to benefit? We would certainly benefit as, based on our experience, this is something that requires a significant amount of time and effort. In order to REALLY benefit, we need to be able to do everything from the centrally produced root trees: e.g. computing any additional variables needed and add them to the trees, being able to do very easy and fast skimming based on objects, etc There is also the concern of having enough information: e.g. in the past having the CalDataChunk, full trigger information, etc, available in the top_tree has been crucial. Have the size issues related to the tmb_trees been resolved? The interface to objects should be made as close as possible to the TMB, so that framework algorithms could be run: e.g. jet finding. As soon as we need to rerun ourselves tmb_tree production, there is really not much advantage for us regarding the common data format, except for the important fact that code could then be shared with the rest of the collaboration. An additional concern is maintenance: with top_analyze right now we have 24/7 coverage. If there is a problem and we need to produce a new tag and start rerunning top_tree production, we can do it almost immediately. Proper documentation is something non-negotiable for something that is supposed to be a D0-wide data format. What we have learned from Moriond'04, where we provided centrally produced ROOT tuples for data and MC for the first time, are the limitations of the system. Storing the files on big disks (/rooms/... on clued0) means we are disk space limited. Storing them in SAM means we are during hot conference phases IO/SAM station limited (at least this time where partially the CSG skimming and TMB fixing was running in parallel on CAB). It is worthwhile re-thinking the data handling model at the same time as the two issues are strongly coupled. > > 6. If tmb_tree were chosen as the basis for a common format, what > changes would be required to make it attractive to your group? > See reply to 5). > 7. Does your group develop algorithms in root? Should algorithm > development in root be encouraged? What is the best way to allow the > entire collaboration to benefit from algorithms developed in root? Current development of algorithms in root within the Top Group is not analysis framework independent, which is something we really would like to see. It would be highly desirable to make it such that it can be imported efficiently into the framework. The best way is by having the root-based data format have the same interface as the TMB. > 8. Is there any other information that you would like to bring to the > attention of the Data Format Working Group? Please give proper consideration to the long term needs of the experiment and ask yourselves (and/or the experts): are tmb_trees really the ultimate data format that D0 needs to do physics on datasets in the fb^-1 range? We can make this one change in data format but that's really the last one. We cannot afford to keep changing every 2 years. Since it is crucial to have full maintainance and software support for a data structure and the corresponding analysis framework we wonder - if the way to go is EDMROOT (a recommandation/implementation by the D0 computing experts. If that's not the way to go why not, what is the timescale for this project, will we have to change data format again, when ? - this analysis structure needs really fast turnaround time when it comes to debugging. So software releases cycles are too slow. - please do not recommend/develop something D0-specific. Data storage or packing/unpacking is one think, for data analysis a number of packages have been developed, see for example http://pax.home.cern.ch/pax/paxguide/index.html which people are already using in the context of CMS analysis preparation (for question please contact Martin.Erdmann@cern.ch). - please make sure the software can be easily used at external sites, on small desktops and laptops without the need to install huge pieces of D0-software. A small tar ball might just be acceptable. We very much hope this helps. Thanks a lot, Arnulf and Aurelio