From melnit@fnal.gov Mon May 3 14:46:18 2004 Date: Wed, 28 Apr 2004 18:13:14 -0500 From: Alex Melnitchouk To: Gordon Watts , d0dfwg@fnal.gov Cc: melnit@d0mino.fnal.gov Subject: Re: more comments on the future data format [ The following text is in the "iso-8859-1" character set. ] [ Your display is set for the "US-ASCII" character set. ] [ Some characters may be displayed incorrectly. ] Hi Gordon, I appreciate that you got into details and provided these explanations as well as sharing your opinion on root/tmb issue. Regards, Alex Hi Alex, Thanks for the responses. - I think it is inevitable that algorithms will be developed based on this root based format. The problem is porting them back to the framework. If they are simple, then it is easy. If they are complex, then it is hard. An example of the former is a simple muon based tagging algorithm. This does some simple cuts on the muon, a delta-R test w.r.t. the jet, and it is done. An example of the latter is secondary vertex tagging. This is 1000's of lines of code. People are unwilling to do development work in the framework because of the long time it takes to do it. Then you are left with having two versions of the code which must track each other -- not too pleasant. I (personally) don't see this as a black & white issue; I'm certainly not speaking for all of the dfwg on this issue. - I don't remember your comment on good LBN's. I'm also puzzled by it -- in the single top analysis, for example, we use a root-based analysis and do the lbn selection in that analysis. We accomplish this by running a fairly complex series of jobs to extract a text file with the list of bad lbns in it (we don't know of a way around this yet). That text file is then read into a small object which can test each event. See the top_dq cvs package for an example (look in lists/analysis-lists to see the raw files that are fed to the analysis code). - Serban is correct. I've been running speed tests, and last night I generated a simple analysis using MakeClass that plotted jet pt's. I also did the same, but loaded in objects as well. Speeds are the same within the resolution of the timer I used (actually, the load on my CPU). - Under the hood, a object based root-tuple and a non-object based root-tuple are actually the same thing! Just in one case you are using objects that have been compiled into root and in the other case the objects either have to be generated on the fly (open a tmb_tree file w/out loading up the shared library first) or they have to be compiled in (load up the tmb_tree shared objects). Root maps objects onto the branch-leaf structure (and back). The streamer functions are partially responsible for this. The speed up factor depends entirely on the tree's split-level. That is -- does the leaf just contain the jet pt or does it contain all the info for the jet. The more finely grained, the less you need to read in to do something. However, the more finely grained, the more you have to be careful to read in everything you need (i.e. unlike PAW, root does not yet automatically load the variables you need). - I think we've all agreed we like the FAQ idea. But our convener has been distracted by a talk they need to give, and so hasn't had time to start it yet. Cheers, Gordon. -----Original Message----- From: Alex Melnitchouk [mailto:melnit@fnal.gov] Sent: Saturday, April 24, 2004 2:40 PM To: Gordon Watts; d0dfwg@fnal.gov Cc: melnit@d0mino.fnal.gov Subject: more comments on the future data format Hi Gordon, Thanks for your reply ! i have a few things to add: -- on issues that came during this week -- following up on what you wrote in reply to my previous email There was a question during the CSG meeting discussion about using this future root-based data format also for software other than analysis software. I would be against this idea. I think all "pre-analysis" software such as reco and d0correct should be done on tmbs, in the framework, organized as it is now according to releases / package versions, meeting D0 standards. The analysis software, on the other hand, should be decoupled from the framework, and written in the way a user wishes to write it. The data sample and ROOT should be the only two things needed in order to do analysis. Users may want to choose to do this in the D0 environment (say to get some extra functionality from standard libraries or to use the compiler, or/and ROOT version that come with it) but this should be an option not a requirement. I'm sorry that i'll be repeating now the same thing that i was saying earlier (without saying anything new really ). Since we need to select good lbns for analysis, the current way of doing it (framework-dependent way) should also be changed, i.e. decoupled from the framework somehow if we want to do analysis in the framework-decoupled way. I understand that luminosity software is not the direct concern of this working group. On the other hand talking about data format for analysis and considering using it independently from the framework we do bump into having to think about good-lbn selecting software too. Or is my picture more complicated than real situation ? Concerning object-based vs flat format: There is smth that Serban pointed out to me recently which i'd like to clarify: tmb_trees really fall into both categories, i.e. a user can use them in two ways: i) as regular trees, i.e. reading branches and accessing individual variables in them without having to use any classes. ii) load libraries that contain tmb_tree classes and access info via objects of TMBEmcl, TMBJet etc classes, which current most popular way of doing it. Is this correct ? On your reply to my previous email: Gordon: --------------------------- BTW, I suspect it is best to switch off formatted email when sending email to a large group of people; many people use pine and, being stuck in the 1950's, can't read formatted email as the rest of us do. :-) ---------------------------- Thanks for pointing this out. This one should be plain text. Please complain if not. BTW, suggesting that some people in *this* group may be stuck in 1950's might open quite a bit of room for discouragement :-), but i will not go there ! Gordon: --------------------------- I do agree that using the compiler is just better in the end! -------------------------- Great ! Gordon: --------------------------- I suppose there are two issues here. One is choosing a data format, the second is how you support it and what extra samples and features you have. To a large extent, as long as you stick to root, I suspect these two are, for the most part, independent. -------------------------- Yes, i agree that there are two issues, and i guess at this point you guys are more interested in having input on the first issue. However, i think they are closely tied, e.g. ntuple fromat and root-chunks frormat would require very different level/amount of support i think, maybe it would make sense to think of the whole thing even at this point. Gordon: --------------------------- 4. I've recently seen a rather cool idea -- you can build a version of root with extra classes linked in. One could then make a new version of root that had the root-data-format-objects linked in. Typing "d0root" or similar, and you'd have TMBJet (or whatever). --------------------------- Great! This option would definitely be very convenient for a user i in case of tmb_tree-like format. Gordon: --------------------------- 5. Depending on how you do you analysis this is either easy or it isn't. If you are totally object based (as are the tmb trees for the most part), this isn't too hard. In fact, you can write code that has _no_ branch names, and then ask for particular TClonesArray's or single objects from the tree at runtime. Root is flexible enough to only read them in when you ask for them. I've done written code for one of my frameworks that does this already, so I know it can be done. However, while to get a good speedup as long as you don't need tracks, it is some pretty hairy coding. ----------------------------- I see. Thanks for explanation. What about non-object based format ? Say 2-level trees (branches and leaves), if we just turn off reading specific branches, would the speed-up factor in this case be smaller than in the example that involves TClonesArrays which you describe ? Gordon: --------------------------- 6. I think you are talking about a FAQ or a better search engine for the d0rug mailing list here. :-) -------------------------- More about a FAQ or a web-page organized e.g. like this one http://www-d0.fnal.gov/phys_id/emid/d0_private/EM_Particle_Documentation _EMI D.html and/or like this one http://www-d0.fnal.gov/phys_id/bid/d0_private/doc/use_tmb_bcjet/index.ht ml which would contain the answers to many questions whicn in the absense of such a web page would be addressed to d0rug (like description of variables and how to access them). As for searching through d0rug archives, e.g., to find info about certain type of crash, etc. i think it works fine now. Regards, Alex ________________________________________ From: Alex Melnitchouk [mailto:melnit@fnal.gov] Sent: Sunday, April 18, 2004 8:50 PM To: d0dfwg@fnal.gov Cc: Alex Melnitchouk Subject: comments on the future data format Dear data format working group, I have 7 comments: 1. First of all -- JUST ONE ROOT BASED format is a great idea ! 2. I have been doing tmb_tree based analysis and have been very pleased both with this particular format and with help I was receiving from Elemer, Eric, and Serban whenever i needed it. I wouldn't mind if the new format will be similar to this one. 3. If it is indeed going to be smth similar to tmb_trees, I would express an opinion that, even though it is ROOT based, it's usage (and then official code examples / support) does not need to be interpreter-oriented. I prefer to use it with a compiler and would like the support/documentaion to be compile-mode oriented instead. So far I was borrowing from some *unofficial* examples from some users and then passing those to other users. It was not a big deal but I thought why not have it as smth official/standard (vs. made and distributed by individual users ) since many analyzers -- to the best of my knowledge -- are working in the compile mode anyway. The benefits will be increasing as datasets grow too. Besides, if this approach would allow to build in the language requirements more strict than those that ROOT generally imposes -- i think this would definitely be a big plus and will eventually pay off in better understanding of our data. 4. I am certainly not that familiar with the tmb_tree internals and peripherals, but as a user I would suggest the following simplification for the stage of preparation of common shared library (in case the tmb_tree-like approach will be followed for the data format to be decided on ) : instead of this: ------------------------------------------------------------------------ ---- ---------- 1). check with which tmb_tree and tmb_analyze package versions d0correct executable that produced the tmb_tree file to be analyzed was built. 2). create a release area 3). addpkg tmb_tree and tmb_analyze of those specific versions 4). ln -s tmb_analyze/macros macros 5). build shared library as instructed in tmb_analyze/macros/README.txt 6). do analysis 7). in case new d0correct version came up -- repeat steps 1) through 6) ------------------------------------------------------------------------ ---- --------- have this: ------------------------------------------------------------------------ ---- ------ 1). copy(of just include) already existing (build centrally (just once) with proper versions) shared library to my working area which does not have to be a release area in this case 2). do analysis 3). in case new d0correct version came up -- repeat steps 1) through 2) ------------------------------------------------------------------------ ---- ------- There is a stretch in mentioning not having to create a release area: one eventually would still need the luminosity package, and, consequently, release area. On the other hand, if the luminosity software (i'm talking specifically about the software that all users are using in physics analyses to identify good lbns) could possibly be decoupled from the framework too (and even encouraged to be decoupled by nature/properties of the future data format to be decided on) it would be quite advantageous i think. 5. Turning on/off specific branches to read : having this option available, user-friendly, and explained in the documentation in a way that would be transparent to any novice -- would be great. 6. it would be more efficient if basic questions that many users may have about the tuples/trees(using them) would be answered in advance on the webpage and naturally taken off the d0rug or/and private user-author email exchanges, (e.g. -- in which coordinate system that particular variable was calculated, or -- what was the pt threshold when counting clusters or -- if tuples/trees were produced with certain release version, in which release should one analyze them) i hope, since it would be common D0 format it will be easier to maintain good up-to-date documentaion. 7. I'll be looking forward to seeing this new format and doing analysis with it ! Regards, Alex