Executive Summary Mike Isely / Mark Fischler V1.1 95/4/14 CAP Design Concepts for High Performance Analysis of Very Large Persistent Object Structures The CAP project aims at improving the tools available for analysis of large physics data sets. A particular tool provides the ability to query large collections of data, applying complex criteria and extracting detailed data. An early implementation of this query software is implemented on a testbed IBM SP-2 system (a cluster of AIX POWER2 nodes with an IBM-proprietary high speed switch interconnect). From creating and using this implementation, we have learned quite a bit about the nature of queries and where performance bottlenecks exist. That experience is being used to write an entirely new underpinning for the query software, to be called POPM ("Physics Object Persistency Manager"). This implements an API dealing with structures and objects which may reside in disk or tape storage, yet can be used as if they resided in memory. A set of notes have been written describing all of the issues behind the new design. Summarized here is the background of the API underpinning, some basic implementation ideas, and major highlights of concepts which permit efficient performance. For the details, see those design notes. Background The central query software currently in use on CAP utilizes the concepts of persistent objects, with a client/server scheme to allow these concepts to be used on data distributed across multiple I/O servers. The API defining the persistent object concepts stems from the API of the "ptool" product created at UIC (University of Illinois at Chicago) as part of the PASS project under Bob Grossman. The original UIC version was designed for use in a single-user non-distributed Unix environment; modifications made at Fermilab permit multi-user access to a data set distributed over a set of storage nodes. Some of the code in the current CAP implementation is essentially a modified version of the "ptool" product. While this was suitable for a proof-of- principle early implementation, the ptool code is in many places not up to the standards required for a robust system, and is in general not tuned for efficient performance of event selection in large HEP datasets. Moreover, the current design has a number of basic deficiencies which prevent achievement of the necessary performance levels. In consequence, Fermilab is creating a new persistent object manager. The API defined by ptool will largely be retained for compatibility, though certain new concepts will be added to address identified deficiencies. The underlying implementation will be completely different, designed with the needs of distributed I/O systems -- and in particular with the need for efficient query processing -- in mind from the start. No code from the UIC ptool implementation will be used in this new Fermilab object manager. Overall Design Persistent data is kept in large collections called "stores", which in turn are divided into several levels, the smallest of which is a fixed-length "segment" of data. A query process accesses this data through a 64 bit "persistent pointer". The query system comprises compute nodes and I/O nodes, interconnected via a high speed switch. On each node is a central shared memory, containing control structures, and an array of "slots": Each slot can hold one segment of persistent data. When a query process accesses data, the needed segment is caused to appear in a slot in the shared memory. The POPM implementation centers on moving segments into and out of slots in an efficient manner. In addition to query processes, the POPM system software has I/O server processes, disk slaves (which perform the physical disk I/O), and shared memory manager processes (which handle local shared memory and clean up wreckage). There is one I/O server and one shared memory manager process per participating CPU node. When a persistent pointer is dereferenced, POPM ensures that a slot containing the required data is set up. When a segment is needed, the local I/O server process picks up the request for data in that slot, and forwards its description to the I/O server of the correct I/O node. A disk slave sees the request and reads the proper data, which the I/O server then forwards to originating I/O server, and thus into the required slot. New features / performance improvements The following features enhance efficiency: (1) I/O server processes can forward multiple requests without waiting wait for completions. (2) The query process may perform read-ahead by setting up multiple slot requests at a time. (3) Multiple disk slaves allow for concurrent disk I/O. A new concept of a "physical pointer" will be introduced. This should significantly reduce CPU overhead when multiple attributes of a single object are needed. The various fields that make up a persistent pointer will be run-time flexible in POPM, by encoding them as part of the store metadata. POPM will explicitly permit multiple persistent "address spaces". This will allow multiple data models and experiments to coexist cleanly in a single query system. Decision logic for read-ahead in POPM will be implemented in the query process. This will result in full overlapped I/O and better switch usage. Within a store, striping across disk units will be on a segment basis. This relatively fine-grain striping, in combination with read-ahead logic, should result in highly parallel disk access, even for a single query process. Particular efforts are made to minimize the CPU cost of a dereferencing a persistent pointer. For example, the searching algorithm to determine whether a slot with the needed data is already valid will employ a hashing scheme, instead of the simpler but slower linear search. And when new segments are brought in, the number of memory copy steps will be kept to an absolute minimum.