Recently, there has been some confusion on what we are trying to do with streaming. Here are some points on that... Why stream? 1) We must be *very* careful in explaining that when we say streaming, we mean streaming the RAW, DST and thumbnail main physics data. We do not mean making a derived data set ("skimming", "stripping" or "tapping-out"). The parent data of a tap-out or skim would be the streamed data. 2) Although the thumbnails will be streamed (because the raw data are streamed and this propagates to all data tiers), streaming is more important for the data-tiers on tape (DST and RAW), as I'm assuming that there will be some kind of virtual streaming (like D0dad in run 1) for the thumbnails. Though having the thumbnails streamed does makes it easier to create private run-event lists for a particular analysis. 3) There will be too many tapes to reprocess an "ALL" stream if the need arises. Streaming is then necessary to reprocess only the most affected data. 4) Streaming the raw data allows for prioritized and/or private reconstruction. For example, perhaps an off-site Remote Analysis Center will process b physics data (running more tracking algorithms than the default reco). Having the majority of the b data in their own streams would make this task much easier. 5) Picking events is more efficient if the raw/DST data are streamed. If you are picking events for your analysis and assume that those events are randomly sprinkled throughout the data, then having the data streamed means you'll have fewer tape mounts (your data will be on fewer tapes). --------------------------------------------------------------------- Why stream online? 1) Online streaming propagates to all data tiers. You get DST and thumbnail streaming for free. 2) Only online streaming allows for prioritized RECO (see item 4 above). Reprocessing from the RAW data can only be done if the data are streamed from online. 3) Only online streams makes picking RAW data more efficient (see item 5 above). 4) Online streaming is insensitive to reco specific problems. Of course streaming is still sensitive to trigger problems, but then again everyone is sensitive to trigger problems. ------------------------------------------------------------------------ Why stream based on L3 triggered physics objects (why not stream by trigger bit)? 1) Streaming by L3 triggered physics objects is nearly equivalent to streaming by trigger bit. Groups of triggers are categorized by the type of physics objects they require (e.g. "highPtElectron"). The event is streamed based on the categories it passes. This system is easier to manage than streaming by trigger bit - the streaming scheme does not need to change for minor changes to the trigger list. The categorization is actually done by looking at the filters in passing L3 filter scripts, but this can be overridden from the trigger list (for example, QCD-gap triggers use the regular L3 jet filters, but we can still split them off from the ordinary jet events with overrides). Of course major changes to the trigger list would perhaps require rethinking the stream scheme. --------------------------------------------------------------- Why exclusive streaming instead of inclusive streaming? 1) We can't afford the tapes to make duplicate copies of events. Also, we would not have the online throughput for a high duplication rate. Though we can't do inclusive streaming online, we can make some small inclusive streams offline. The run 1 duplication rate was about 50%. 2) The Reco farms do not have the capacity to process inclusive streams with a high duplication rate. --------------------------------------------------------------- What are some potential disadvantages of this plan? 1) Exclusive streaming means events for your triggers may to go more than one stream. The analysis tools group plans to make the streaming transparent to users -- a tool will figure out what streams your events went to and return you a list of files. You will not have to negotiate the stream scheme yourself. 2) Exclusive streaming means that most of your events can go to one (or a few) streams, but a small number may go to other streams. This problem would affect precision analyses the most (like W/Z). We are trying design a stream scheme so such analyses would not have this problem. Other analyses may choose to use only their large streams, ignoring the small fraction in others. 3) Sensitivity to missing data: If events for your analysis go to streams A and B and stream B has some missing files (say a raw tape goes bad and the events can't be reco'd -- this seems to be rare these days though) you must eliminate all of the missing luminosity blocks from your data. That is if events in stream A come in during those luminosity blocks, you must ignore them to get an accurate luminosity. If stream B is small (has rare events), losing a file may mean losing many luminosity blocks. Our goal is to keep the streams large enough so that this problem should be minimized. 4) Luminosity is harder to calculate. With triggers going to more than one stream, bad luminosity blocks must be handled much more carefully (see above). Small streams are more sensitive to data losses and perhaps careful cross checks are needed to verify stream integrity. So the gist is this: if we want to do streaming, it has to be exclusive. Doing streaming online allows one to do more reprocessing than offline streaming. Though there are disadvantages, we can hopefully minimize them to get the advantages of streaming the data. I hope this clears up some misconceptions. Let me know if you have questions or comments. -- Adam