Event data interface

David Adams
12 May 1997 1940
[http://www.bonner.rice.edu/adams/event]


Introduction

The bulk of the data in HEP is "event data", i.e. that which can be uniquely associated with a particular event. This event data includes the raw data read from the detector and the data obtained by processing this raw data. In an OO world it is natural to identify the event as a class whose primary responsiblity is to manage this data.

Here we present an OO model for the event. The data in the event is organized into collections of objects of common type (e.g. tracks, jets or electrons). Here we call these collections "chunks". Each chunk is responsible for managing a clearly defined set of data which is loosely coupled with the data in other chunks. Typically an event is created with raw data and then a series of reconstruction algorithms are applied. Each chunk contains a reference to a generator (aka a reconstructor) which is used to generate its data.

Some justification of this model can be found in a talk given to the D0 data model group.

Here are some requirements for this or any other model.


Model

The model is illustrated in five figures: a data class diagram, a key class diagram, a generator class diagram, an event trace for creating a chunk and an event trace for retrieving a chunk. Physical dependencies are shown within the model and for RECO packages which make use of the event package.


Classes

Event

The event class provides a container for chunks of data. The chunks are organized by type: the different types are loosely coupled and typical event reconstruction consists of generating one chunk for each type. Support is provided for storing multiple version of a chunk to support the evolution of algorithms. The different chunks within a type must be assigned unique names. One can be specified as the default.

The event contains pointers to abstract chunk objects and not to concrete subclass objects. Thus the event has no dependency on the type of data being stored. An existing chunk can be fetched as a generic chunk with event.get_chunk(chptr) or in a type-safe manner with key.get_chunk(event). The argument in the former is constructed from the type and chunk names.

The event provides a method for inserting new chunks. The user supplies keys which are used to resolve, check and assign the parent chunks at the time of insertion.

Chunk

Each chunk manages its data providing the following services: Concrete chunk classes are derived from the template abstract class TypeChunk which is in turn derived from the class Chunk. The template argument is the actual data type allowing the template class to provide type-safe access to the data and generator.

Generator

A generator specifies the algorithm for constructing the data including all relevant parameters. As for the chunk, concrete generators are derived from a template TypeGenerator which is derived from the base Generator. The template defines the interface for data generation with a type-safe return. The input to the generator is the list of parent chunks.

The association between chunk and generator is indirect. The chunk uses the static generator manager is to fetch the concrete generator. Strong typing is lost but the method get_type() is used at run time to verify consistency. The advantage is that the coupling between the concrete chunk and generator is broken. It is possible to have an executable which fetches and manipulates a chunk and its data without linking in the code for generation of that data.

Key

A key is used to specify a chunk within an event. Again an abtract template class TypeKey is derived from an abtract base Key. The event only makes use of the base. The template argument is the specific chunk type (not data type as for the chunk and generator). This allows it to provide type-safe access to the data.

Different types of keys may be derived from the TypeKey. These are templates with the same argument. We immediately identify two or three within the event category. The concrete NameKey retrieves a chunk specified by name. The abstract TestKey loops over chunks and returns the one which best satisfies a test method. Users derive from this class to implement this method. Finally, for those that like to construct generators from RCP (parameter) files, there could be an RCPKey which selects a chunk with a matching set of parameters.

The template TypeKey also provides a method for promoting chunks, i.e. converting a persistent pointer into real pointer. The details of this conversion depend on the persistency mechanism but this placement of the conversion allows for minimal coupling.

GeneratorManager

The generator manager has the responsibility of managing the list of known generators. These are constructed by the user and assigned a name when they are registered with the manager. The pointers from chunks to generators are implemented with these names. This eliminates the link-time coupling between between chunks and generators. Thus it is possible to construct a program which makes use of a chunk without linking in the code for its generator.


Interface example

The above classes are all part of the event class category. The interface between that category and any of the data categories is defined by the abstract chunk, generator and key classes. Each data category must provide appropriate concrete subclasses for each of these.

To clarify the above ideas, we show a model for the interface for hits in the D0 scifi detector. Here is a class diagram. The data consists of a series of layers each containing a list of hit objects. The chunk provides methods for accessing the list of layer names and the list of hits associated with each. The concrete generator contains a hit algorithm for each layer.


Dynamic behavior

The discussion above describes the responsibilities of the classes (event, chunk and key) but not their dynamic behavior. Here we describe how they interact with one another. We divide these interactions into four categories: defining events, generating data, persistent allocation and analysis.

Defining events

Typically an event object is created each time the data acqusition system assembles the appropriate collection of detector data. This data is organized into a raw data chunk, an event is created and the chunk is stored in the event. The event is then defined by adding a series of predefind chunks. This list of chunks might depend on the type of trigger used to select the event. An event trace diagram shows how a chunk is created and assigned to an event.

Generating data

If all chunks are enabled for automatic data generation, then any type of data (as defined by the chunks) may now be fetched. However, there are many chunks for which generation is rather time-consuming and it may be desirable to ensure that a collection of events are reconstructed in a consistent manner. For these reasons, a production computing farm is used to generate data which is then kept in persistent storage.

The data for a particular chunk is generated by constructing a key identifying the chunk (usually the same key is used for many events), using that key to fetch the chunk from the event and then asking the chunk to generate its data. The second event trace diagram shows how a key is used to select a chunk.

Persistent allocation

After the data has been reconstructed, the volume of data is too large to fit in an affordable system of disks, tape robots or even tape warehouses. Instead increasingly large fractions of data are maintained in each and a substantial fraction is discarded. This event model provides many features to faciltate this allocation:

  • The division of data into chunks provides clear boundaries.

  • The chunk can manage the persistent storage. It can have flags to indicate whether and how data should persist.

  • Data may be safely dropped because regeneration is automatic.

  • Specialized disk-resident chunks may be created to cache a subset of the data from other chunks and the parent chunks may be pushed to tape. These are the equivalents of DST's and ntuples but have the advantages of being chunks:
    • same access as any other data
    • complete history of generation including algorithm and parent chunks.

Analysis

The key feature for analysis is rapid access to the appropriate data, which is facilitated with the data allocation described above. Chunks are fetched with keys and the data is extracted from the chunk as prescribed by its interface. For late-stage analysis, this chunk might be an ntuple object or be capable of producing a ZEBRA-based ntuple.

Iteration

Of course, these interactions do not always occur as ordered above. The process is iterative: there will be new types of data and new algorithms to generate existing types. The division of the event into pieces with well-defined interdependencies and history helps to ensure that invalid data is regenerated and valid data is not. There may also be changes in the types of data, i.e. schema evolution. This is also something that could be handled by the chunk.


Physical dependencies

The discussion above has focused on the logical model: identifying classes and their responsibilities and showing how they interact with one another. The event model is the foundation for all of reconstruction and we want to ensure that event package and especially the overall reconstruction are understandable, maintainable and testable. For these reasons we look at the physical couplings (compile and link dependencies). Our major goal is to avoid cyclic dependencies especially those which couple a large part of our system.

The above requirements hold for both packages (e.g. the event classes or the software used for one kind of data) and for individual classes. By design the event model has no dependencies on any of the packages that will be used to access or generate data in the event.

The first figure shows the dependencies for classes within the event package. Dependencies introduced by allowing Chunks to construct their own data are shown with dashed lines. Other than these, we see that there are no cyclic dependencies except between the event and key classes. The key must make use of the event to extract chunks. The event uses keys when inserting new chunks. We have chosen to keep this small cycle rather than to complicate the model by splitting the key class to try to break the cycle.

The cycles introduced by allowing chunks to generate their data would likely be difficult or impossible to remove without giving up this capability. Note however that this cyclic behavior is confined to the event package and is not reflected in the derived classes which appear in the reconstruction packages.

The second figure shows the dependencies for a reconstruction package whose data is built from the data in one other package. We see the dependencies are very clean and that no cycles appear. The event classes are not shown in this diagram but they are no dependencies pointing back from that system.


New Ideas

Here are some recent changes and new ideas:

  • The chunk, key and generator classes have been split into base and derived template components. The latter take the actual data (or chunk) classes as template arguments allowing us to recover much of the strong typing that was given up to break coupling. Many common type-dependent methods have been moved into these template classes simplfying the programmer's interface to the event data classes.

  • Physical coupling diagrams have been added.


Unresolved issues

Here are some unresolved issues:

  • If we allow the parent chunks to change, then we must introduce the concept of stale data which must be dropped or regenerated. The data for all descendants also becomes stale. It may be useful to allow for both fixed-parent and evolving-parent chunks.

  • We have implicitly assumed that the data in a chunk never changes. Even if data is dropped, we expect to get back the same data with regeneration. However, changes in the code, compiler, operating system, etc. can lead to different results.


Here are some comments and questions about the model.


Other relevant links

D0 event data model page


Please direct questions or comments to David Adams (adams@physics.rice.edu).