From gwatts@phys.washington.edu Wed May 5 08:29:18 2004 Date: Tue, 04 May 2004 22:51:02 -0700 From: Gordon Watts To: Suyong Choi , d0dfwg@fnal.gov Subject: RE: summary of answers to Dfwg survey - Higgs group Hi Suyong, Thanks for the detailed comments. I don't think anyone else has replied, and while a number of the issues are common ones that have been raised before, there are some new ones. Feel free to complain if I've missed the jist (sp?) of something, of course. >It contains most of the tmb_tree content in a few kilobytes/event. Ouch. This will be very difficult. You later on make the comparison with the TMB. I think it would be interesting to see exactly where that extra data was going, but... there are two trade offs involved with that 44 bytes (vs the 272). First is speed of unpacking. You can see this in some of the quick tests we've done reading TMBs and reading tmb_tree (20 Hz vs 300 Hz). Second is ease of use if someone wants to use a MakeClass style analysis. If things are tightly packed they will have to have common code available to unpack. Best place to put that is hidden, in the objects written to the root tuple. On the other hand, there may be redundant info in that 272 bytes. For example, some of the information stored in TMBTrks are things like "closest vertex", etc. (a TRef). That could be calculated on the fly. Again, the speed vs data size trade off. BTW, how did you calculate that 272? Are 44 and 272 on equal footing? For example, I think both root and TMB run gzip on their output data. It could well be TMB gets less of a compression factor than root. Assuming the 272 is uncompressed, the Trks->Dump () says the TMBTrks branch gets a compression factor of 2.1, so on disk this is 136 bytes. Still large compared to the 44. BTW, you can turn up higher compression for root (it doesn't use the maximal compression setting for the root trees). I don't think it would make more than a 10% difference, but it isn't the default -- which has been optimized for speed vs cpu time. >3. It can be read fast. Quantities that are > computationally intensive to compute should be calculated on demand > rather than in streamers. This implies a move towards methods of objects that are read back and away from MakeClass, or to utility routines, or making the root-tuple larger (i.e. btagging information). >Due to the slowness of working in d0 environment (linking, running, and >debugging), >algorithm development outside the framework is unavoidable. >However, algorithm development (using ROOT) should be done carefully, >especially the design of classes and packages, with assistance from >true software experts to make it simple and portable. >It can be written so that it is not tied to any specific format. You are one of the people that is responsible for putting together reco. Have you given some thought to how one might solve this problem? If so, could you share any ideas you've had? :-) Cheers, Gordon. -----Original Message----- From: Suyong Choi [mailto:suyong@fnal.gov] Sent: Monday, May 03, 2004 10:12 AM To: d0dfwg@fnal.gov Subject: summary of answers to Dfwg survey - Higgs group Hi, Here is the summary from the Higgs group. Regards Suyong 1. What analysis data formats and analysis tools are members of your group currently using? > Higgs group use various formats Athena, higgs_skim, higgs_multijet, tmb_tree and top_tree tuple makers all with d0correct applied. Except for TMB_tree, others are non-object format root-tuples. 2. What analysis data formats or analysis tools does your group recommend to its members? > We don't make recommendations. Subgroup leaders may suggest some format for which they already have analysis code ready. Analyzers are encouraged to check their results against those obtained by others using different formats. 3. Do you encourage or discourage people to use tmb_tree? Why or why not? > We do not encourage or discourage tmb_trees. This is a personal preference mostly. Some people don't like to use objects and/or find it cumbersome to use. Other formats are smaller, faster, easier to modify, and easy to analyze both at the root command line and in standalone programs. 4. How does your physics group support the efforts of analyzers? That is, does your group provide centrally managed data sets, tuples/trees, or analysis tools? > We use Common Sample Group's skims. Each subgroup makes the tuples. Datasets and analysis tools are provided for the Athena and higgs_skim format. 5. Would your group benefit from the availability of common, possibly centrally produced root trees? What requirements would a common root format have to fulfill for your group to benefit? > We would certainly benefit from a centrally produced tuples, eliminating the need for us to support our own format and generate our own samples. The requirements are: 1. It contains most of the tmb_tree content in a few kilobytes/event. 2. A standalone program to analyze the format can be linked within a few seconds or less. In other words, it shouldn't depend on a huge amount of code and d0 environment. 3. It can be read fast. Quantities that are computationally intensive to compute should be calculated on demand rather than in streamers. 4. It should be easy to strip events, trim branches, and add user specific branches geared toward particular analysis without writing a new class. 5. It probably is a good idea to keep the common tuple in SAM system so that access to tuple is consistent to other data sets and also accessible from remote. If it's too big or slow or it takes forever to link, we'll want to continue making the current root tuples and the benefit will be lost. 6. If tmb_tree were chosen as the basis for a common format, what changes would be required to make it attractive to your group? > At least a clear documentation of all the methods without too much navigating should be available. Also, It should be a lot smaller. The tmb_tree takes about 20kB/event, much of which is redundant. The tmb_tree track object, for example, uses 272 bytes/track while the tmb uses 44 bytes/track. Other roottuple formats fit essentially the same information into 3.5kB/event and could be made still smaller. The small format allows large data and MC samples (including the complete 1EMloose, 1MUloose, and QCD moriond skims) to be kept on a single workstation. This speeds up the analysis cycle. 7. Does your group develop algorithms in root? Should algorithm development in root be encouraged? What is the best way to allow the entire collaboration to benefit from algorithms developed in root? > We currently do not develop algorithms in ROOT. That being said, the major improvements to physics in the past couple of years came from algorithms developed and optimized outside of d0 framework environment, e.g. tracking and b-tagging. Due to the slowness of working in d0 environment (linking, running, and debugging), algorithm development outside the framework is unavoidable. However, algorithm development (using ROOT) should be done carefully, especially the design of classes and packages, with assistance from true software experts to make it simple and portable. It can be written so that it is not tied to any specific format. 8. Is there any other information that you would like to bring to the attention of the Data Format Working Group? > The root-tuple maker should directly use D0 already existing packages/code, e.g. d0correct, metreco,... not re-code these in the root-tuple maker itself, to avoid more chances for mistakes.