This article was kindly scanned, transformed into text, and spellchecked by George Perkins and Jeannette Dubendorf of MSU, with a little postprocessing by Jim Linnemann. p 308 COMPUTERS IN PHYSICS, VOL 7, NO. 3, MAY/JUN 1993 SCIENTIFIC PROGRAMMING SOFTWARE FOR PORTABLE SCIENTIFIC DATA MANAGEMENT Stewart A. Brown, Mike Folk,Gregory Goucher, & Russ Rew Stewart A.Brown is a physicist at Lawrence Livermore National Laboratory, Livermore, CA 94550 Mike Folk is a computer scientist at the National Center for Supercomputing Applications, Champaign, IL 61820; Gregory Goucher is a computer scientist at NASA Goddard Space Flight Center, Greenbelt MD 20771 Russ Rew is a computer scientist at Unidata Program. University Corporation for Atmospheric Research, Boulder, CO 80307. The problem of processing large amounts of scientific data is a multifaceted one, and many different groups are addressing parts of the problem. Observational data are pouring in from various data- gathering networks and instruments. Numerical simulations generate vast amounts of data. As the processing power of computer systems increases, the problems of getting large volumes of data into and out of applications and of visualizing the results have become significant obstacles to gaining insight from the data. A related problem is how to organize data sets with a view to the various kinds of access that applications demand. Often a diverse range of computer systems is used for processing related data. A numerical simulation may be set up on one workstation, run on a large parallel computer, and the results visualized on a different graphics workstation. Some or all of these computers may have different binary data representations. This presents its own set of problems. In this article we discuss the general philosophy of portable scientific-data-management systems and compare four of the systems that address these kinds of problems: CDF, HDF, netCDF, and PDB. Introduction Traditional methods for handling scientific data include the use of flat sequential files of machine- specific binary data or portable ASCII data. For applications such as visualization, such methods are inefficient in storage, access, and ease-of-use. Relational database systems fail to accommodate the multidimensional or hierarchical structures often found in scientific data sets. In addition, relational systems do not provide adequate performance for the size, complexity, and type of access required for many scientific data sets. Object-oriented database systems may provide solutions to some of these problems in the future, but relevant experience with such systems is still meager. In the mid 1980s, efforts at several institutions identified and attacked these problems using various approaches. The systems that were developed independently have features in common, as well as differences that reflect the emphases and requirements among the institutions involved. Three of the systems (CDF, HDF. and PDB) developed from independent models of scientific data; a fourth (netCDF) was built on the CDF data model with an independent implementation designed to support different trade-offs. Some consolidation of features is now occurring: for example HDF now supports the CDF/netCDF data model using the HDF format. These systems address the need for portability of both data and software, for flexible control of the organization and contents of data files, and for ways to make data self-contained and self- describing. If a data set is self-describing, general-purpose tools can extract not only data values, but also information about the data values (e.g., units) and relationships among components of the data (e.g., whether a set of values represents coordinates for other data). The systems that we discuss here are implemented as software libraries. This provides a great deal of flexibility for users: applications can be built using a library's data-access interface to deal with data at a higher level than bits and bytes. Each system also defines an associated format for files that are accessed by calling routines in the associated library. We wish to emphasize two meanings of portability when discussing these systems. The first meaning refers to the data; the second, to the libraries. Data may be stored in a file using a portable data representation. Such a file can be moved among computers with different architectures, and the data in the file can be accessed without an explicit conversion step. The various libraries achieve this data portability in different ways, but all can handle data format conversions implicitly within the library. Second, the libraries are implemented in, relatively portable way, so that application program that use a library for data access can be moved from machine to machine by merely recompiling the programs. Table I. Comparison of four scientific-data-management systems. Feature CDF HDF netCDF PDB Languages supported C, F77 C, F77 C, F77, C++ C,F77, SCHEME Inherent data types char, short byte, short byte, char char, short int, float, long, float, short, long, int, long, double, double double double string, float float pointer User-definable types no yes no yes Data-conversion method XDR, native XDR, native XDR PDC Maximum array dimensions 10 unlimited 32 unlimited Extended array dimension yes yes yes no Hyperset access yes yes yes yes User-definable attributes yes yes yes yes Attribute types any any any any Named dimensions no yes yes no Array-index ordering row, col row, col row, col row, col Shareability yes yes yes no Compression no yes no no Supporting tools yes many ncdump, ncgen, PDBView, a few others PDBDiff, ULTRA II To facilitate a comparison among the four systems (see Table I), we use the following list of features: o Languages supported. The library can be used from these languages. o Inherent data types supported. This refers to data types (such as integer or double) for which the library has built-in support. o User-definable data type supported. Does the library permit applications to define and use their own data types (other than arrays, which are supported by all the libraries)? o Data-conversion method. How is data portability achieved? o Maximum number of dimensions. This applies to an array managed by the library. o Extendible array dimension supported. Does the library support at least one dimension that may be extended as data are added to a data set? o Hyperset access. Does the library support access to subsets of a larger data set, such as cross-sections? o User-definable attribute. Does the library support a fixed set of attributes that may be attached to the data (e.g., "units"), or are user- and application-definable attributes supported? o Attribute types. What data types does the implementation allow for attribute information? o Named dimensions. Does the implementation support the concept of named dimensions? o Array index ordering. Does the implementation support row-major or column-major array-index ordering (or any other)? o Shareability. Does the implementation allow one writer and multiple readers to access the same data concurrently? o Compression. Does the implementation support data compression? o Supporting tools. Are there tools for browsing files, visualization applications, and data analysis? For brevity, we have not included some features that are common to all four systems. For example, all the systems support some backward compatibility with older versions of the associated format as the system evolves. Also, all four systems are freely available; information about how to obtain the software appears in the box labeled I/O-Paul Dubois on p. 308. CDF The Common Data Format (CDF) software is a scientific data-management package that was designed and developed at the National Space Science Data Center (NSSDC) at NASA's Goddard Space Flight Center (GSFC). The development of CDF arose out of the recognition by the NSSDC of the need for a class of data models that is matched both to the structure of scientific data and to how such data may be used. Even though CDF has its own internal self- describing format, it is more than just a data format. It is a library that allows programmers to access and manage multidimensional data in a fashion consistent with its scientific orientation. The irony of the term "Format" in the name of this soft- ware is that the actual data format utilized by CDF is of no concern to the user, since it is only accessible through the 23 interface routines that insulate the user from needing to know anything about the format. The CDF library was designed to provide the essential framework from which generic applications (e.g., visualization, statistical analysis, browsers, etc.) can easily be created. The library allows developers to create applications that permit users to slice data across multidimensional subspaces, access entire structures of data, perform subsampling of data, and access one element independently of its relationship to other elements. The concept of using an internal data dictionary to describe the contents of a data file is not new for the purpose of achieving a data-independent transportable standard. However, the CDF differs from those earlier formats by being oriented toward the researcher's view of the data. The most important difference between the CDF and conventional data formats is in the selfdescribing nature of the data descriptions maintained within the CDF and its supporting software. This selfdescribing property makes it possible to use CDF for data from a wide variety of disciplines. The CDF library supports two storage models: a multiple-file model, in which one file contains all the metadata and there is one file for each variable, and a single file model, in which all the data reside in a single file. Each storage model has advantages, depending on how the data will be managed, accessed, or updated. When generating a CDF file, the programmer may choose either of two encoding schemes: native encoding, in which a machine's native binary representations are used, and network encoding, in which the data are transparently converted from the native format to a standard data format when writing the data and from the standard format to native format when reading the data. HDF Hierarchical Data Format (HDF) was created at NCSA to provide users with a file format for sharing scientific data in a heterogeneous computing environment, to provide a set of high- level interfaces to that data, and to support the development of scientific visualization and analysis tools that have a high degree of data-set independence. HDF provides simple primitive objects out of which more complex objects can be built. Each type of primitive object is identified by a "tag." The basic structure of HDF consists of an index with the tags of the objects in the file, pointers to the data associated with the tags, and the data themselves. The design of HDF reflects the assumption that we cannot anticipate what types of data objects will be needed in the future, nor can we know how scientists will want to view their data. As new science is done, new types of data objects are needed, and new tags must be created. The HDF library contains programming interfaces designed to provide views of the data that are most natural for users. As we learn more about the way scientists need to view their data, we can add new interfaces that reflect data models consistent with those views. HDF supports most common types of data and metadata that scientists use, including multidimensional arrays, raster images, polygonal mesh data, tables, and text. In the future there will probably be a need to incorporate new types of data, such as voice and video. The HDF library currently supports the following application programming interfaces (APIs) and their corresponding data objects: o A "general purpose" API for doing basic I/O operations. o "Raster image set" APIs for accessing raster images. o A "scientific data set" (SDS) API for accessing multidimensional arrays of the primitive types. o An "annotations" API for reading and writing textual annotations. o A "vdata" API for accessing sequences of records in which the fields consist of different primitive types. o A "Vgroup" API for accessing and managing groups of objects. There is much overlap between the APIs supported by HDF and those of the netCDF, CDF, and PDB formats. With respect to netCDF this overlap will be total when HDF version 3.3 is released. A project supported by the National Science Foundation will result in the implementation of the netCDF data model within HDF. netCDF The netCDF interface was designed to support the creation, access, and sharing of data that are portable, self-describing, directly accessible, and appendable. "Directly accessible" means that a small subset of a large dataset may be accessed efficiently, without first reading through the preceding data. "Appendable" means that new data can be appended efficiently to an existing netCDF file. The netCDF data model evolved from an early version of the NSSDC CDF data model, adding aggregate and hyperslab access, named dimensions, variable-specific attributes, and conventions for coordinate variables. The netCDF implementation first featured the use of the XDR standard for portable data representation, a single-file implementation, and an ASCII- based language for representing the binary data in a human-readable and editable form. NetCDF was initially developed to be used within the Unidata Program as an interface between systemlevel programs that capture broadcast meteorological data and application programs that analyze and display the data. Many other groups have since found netCDF to be useful for sharing data among different architectures, providing a flexible way to access data cross-sections, or providing an extensible interface to data that insulates application programs from the details of a data format. NetCDF emphasizes a single common interface to data, implemented on top of a platform-independent representation. To achieve data portability, netCDF relies on the XDR standard for external data representation. Use of a vendor-supplied XDR library included with most UNIX systems is recommended, but a portable implementation of XDR (made freely available by Sun Microsystems) comes with the netCDF software distribution for use with operating systems that do not include an XDR library. For more information about what platforms are supported, what visualization tools can import netCDF data, what utilities are available, how to subscribe to the netCDF mailing list, and answers to other frequently asked questions, use anonymous FTP to get the file pub/ netcdf/FAQ from host unidata.ucar.edu. PDB PDB began with an attempt to provide the ability to save and restore the kinds of data and structures used in C programs. In fact, the basic form of the API (independent of language bindings) was intended to be as similar as possible to the standard C I/O library interface. The hope was and is to make it as easy as possible for C programmers to save their data structures easily and not have either to: avoid using natural data structures for fear of I/O difficulties or write endless special-purpose I/O routines for each data structure defined. By meeting this goal, support for F77, which has more limited data-structuring capabilities, was easy. Functionality aimed primarily at F77 usages would not provide the needed flexibility to C programs. PDB is implemented in C and one component of the Portable Application Code Toolkit (PACT) set of tools. PDB relies on no software outside of PACT, but it does derive its portability from other parts of PACT. Differences Although they started from almost the same data model, which was developed at NSSDC for an early version of CDF, netCDF and CDF have evolved independently and use different file formats. Other differences are that netCDF does not support native-mode representation or multifile datasets for the efficient addition of new variables, and CDF does not support named dimensions or simple conventions for variable coordinates. NetCDF data, CDF network-encoded data, and some kinds of HDF data are machine- independent: the form in which such data are stored is the same for any platform on which the library is implemented. PDB data, CDF native-encoded data, and other kinds of HDF data use the native representations for the machines on which the data are first written, converting the data only when necessary to native encodings of other machines. Each of these two approaches has advantages. Machine-independent data can be shared transparently across network file systems. With machine-independent data, conversion occurs whenever data are accessed, and for some kinds of data (e.g., byte arrays or floating- point numbers on machines that support IEEE floating-point representa- tions), the necessary conversions are trivial. On the other hand, native representation incurs no conversion cost for any kind of data when accessed on the machine for which the data were written. PDB by default writes data to files in the native format of the machine on which the software is running. It keeps a parametrization of the binary formats in which the data are written in the file, so that when the file is read, conversions can be done to a different format. PDB also permits the file to be written in the binary format of some target system. In this way, application developers can tailor their systems to the expected usage. For example, data generated on a Cray can be written to a file aimed at a slower PC, so that when the file is accessed on the target PC, no conversions need to be done, and access speed is maximized. HDF uses a centralized registry of basic data tags. The HDF tags designate fundamental data types needed in application programs, but because these tags are openended, unanticipated data types may be added. HDF can identify complex data structures with a small tag, and so it can store such structures compactly. PDB can store arbitrary C structures, including complex user-defined data structures. The other interfaces have somewhat less flexibility, supporting arrays, structures for which a tag has been defined, or simple records composed of a sequence of scalars and arrays of various types. CDF, netCDF, and PDB are somewhat more self-describing than HDF, since it is not necessary to agree upon or register a data-structure tag. HDF is a single format that supports a number of interfaces; hence there are a large number of routines with simple interfaces in the HDF library. Users generally use very few of these routines. For example only two HDF routines are generally used for reading and writing raster images. In contrast, CDF, netCDF, and PDB are single interfaces for many different kinds of data. One of the aims of these interfaces is to have a small "surface area," so that the entire interface is easy to learn and use. Hence there are a smaller number of more-generalpurpose functions in the CDF, netCDF, and PDB interfaces. In either case, a small fraction of the routines in the interface fulfill a large proportion of the most common needs. We know of no all-purpose conversion packages that will handle data in all of these forms, but at least two commercial visualization packages can import data from three of the four formats. It is not always possible to represent all the information in a file of one format in one of the other formats, since the systems were designed for somewhat different purposes, and may use different conventions for representing some data relationships. But we believe that all of these packages have advantages over traditional methods for storing and accessing scientific data. Further reading CDF CDF User's Guide, Version 2.3, National Space Science Data Center, NASA Goddard Space Flight Center, Greenbelt, Maryland. HDF NCSA HDF Specification Manual. (Available from NCSA or by anonymous FTP from ftp.ncsa.uiuc.edu.) NCSA HDF Calling Interfaces and Utilities. ( Available from NCSA or by anonymous FTP from ftp.ncsa.uiuc.edu.) netCDF NetCDF User's Guide, An Interface for Data Access, Version 2.3, February 1993, Unidata Program Center, Boulder, Colorado. (Available by anonymous FTP from unidata.ucar.edu in the file pub/netcdf/guide.ps.Z ) Russell K. Rew and Glenn P. Davis, "NetCDF: An Interface for Scientific Data Access," Comput. Graph. Applications, 7~82, July 1990. PDB PACT User's Guide, UCRL-MA- 112087, LLNL. ( Available from LLNL or by anonymous FTP from phoenix.ocf.llnl.gov in pub/ pactdoc < date > .tar.Z) PDBLib User's Manual, M-270 Rev. 2, LLNL. ( Available from LLNL or by anonymous FTP from phoenix.ocf.llnl.gov in pub/pactdoc .tar.Z) Due to production schedules I have not yet begun to receive your comments, ideas for articles, or information to supplement that given in previous columns, but this is a good place to remind you that all of these are eagerly sought. One of my fellow editors commented to me that editing a column like this is like selling insurance: once you run out of friends and relatives to ask for articles, it gets tough. If you cannot e-mail me at duboisl@llnl.gov, please write to me at L-472, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA. Here is how to get the portable database software mentioned in this month's article: ( 1 ) The CDF V2.3 distribution is available via anonymous FTP and NSFnet for VMS systems, via anonymous FTP only for UNIX systems, and via anonymous FTP and floppy disks for MS-DOS systems. There is a user support office (USO) for CDF that you should contact when you need the CDF software, programming help, or assistance for any other form: Phone (301) 286-9506 InternetÑCDFSUPPORT @ NSSDCA.GSFC.NA- SA.GOV (128.183.36.23) NSlnet (SPAN)ÑNSSDCA::CDFSUPPORT. (2) HDF is available on Internet via anonymous FTP (ftp.ncsa.uiuc.edu). Look in directory HDF for latest version of the library. Look at top level for various tools. A mailing list known as hdfnews exists for discussion of the HDF interfaces and announcements about HDF bugs, fixes, and enhancements. To subscribe, send a request to sxu@ncsa.uiuc.edu. Information about HDF documentation can be obtained from the same address. ( 3 ) netCDF is available on Internet via anonymous FTP (host unidata.ucar.edu, file pub/netcdf/netcdf.tar.Z) (4) PDB is available on Internet via anonymous FTP (host phoenix.ocf.llnl.gov, file pub/pactxx_xx_ xx.tar.Z)