HDF5 Notes Ghosh

From canSAS
                  canSAS 2D treated data
     
1. The treated data to be stored might include:

a. sequences of 1D data, measured against a scan variable, eg temperature
b. one or a sequence of 2D maps, again with possibility of scan variables
   eg shear rate, temperature, tomography scan etc.
c. For neutron work the data might be recorded as multiple wave-length 
   slices (TOF); for X-rays these might be from an energy dispersive detector
e. the data may be measured with polarised neutrons 
f. results from multiple detector configurations.

The common characteristic is the intensity, S(Q,v1,v2..), or S(Qx,Qy,v1,v2...),
the deviation in S, Q, Qx, Qy, v1... etc.

In addition, continuing from the discussions of canSAS 1D XML data, it
is useful to add titles, short titles, and a history of precursor(s)
and treatment (SASprocessnote).

Since the contents have been corrected as far as possible to remove
instrument characteristics there is no need to propagate these items
in this treated data; they remain fully described in the raw data.  It
should thus be possible to merge data from different instruments directly
at this stage (so item c. might even be results from distinct instruments).


2. Storage Formats

a. The three platforms where data will be treated are PC-Windows, Macintosh-OSX,
   and Linux.

b. The primary aim of the canSAS-1D format was to provide a data file which
   could be automatically read by existing well known programs.  The XML
   design allows the files to be imported easily into Excel and to be
   displayed clearly by most web browsers.  The stylised ASCII text can
   also be read easily without any tools on all three systems.  Attribute
   fields contained vital information, notably units, for stored values.

c. For potentially large data sets the text readability is much less valuable.

  Navigating large tables of numbers is impracticable.  This was recognised
  early by the initial proponents of NeXus, notably Jon Tischler (APS).
  The format proposed for NeXus was based on HDF, the Hierachical Data Format
  (1990-).  This stored data in system independent binary, 
  in a single file.  For the user this was strucured superficially like
  a unix file structure, with possiblities of creating links (these, for
  example, allow creating a slab of a selected region of data as a
  new entity, but without actually copying the data, simply linking the
  remapping information to the initial data).  The attraction for very
  large image data with multiple parameters is evident.  Consequently
  a number of general data visualisation tools have been developed,
  and are freely available for all three computer platforms.  It was
  not initially designed to have large volumes of single valued metadata.

d. The NeXus project started in 1995 with the aim of standardising storage
   of Neutron and Xray scattering data.  The file format chosen was HDF,
   based on existing tools and generic visualisation software, but only
   using a very limited subset for simplicity.  A hierachical dictionary of 
   class names was created and these were attached to the hierachical data
   items as attributes.  The datafile components were hence navigable through these
   class names (see appendix 1).  The files are however standard HDF files.

e. In 2003 the increasing limitations (complexity, lack of text methods, etc )
   of the HDF file format (versions 1 to 4) were finally overcome witha complete
   change of internal structure, creating HDF5.  The existing petabytes of data
   in HDF4 compatible format requires continued maintenance of the basic toolset,
   but all new efforts were directed to HDF5.  The NeXus project too had
   lead to considereable amounts of data being stored in the HDF4 format,
   and the NeXus programmers have striven to create a library which
   is transparent to the type of data file involved.  This is done at
   a cost of some complexity.

f. For canSAS-2D treated data the need to map the metadata between instruments 
   has been removed; there are few reasons to follow the guidelines for writing
   a NeXus compatible file.  There are problems of installing software which
   are dependent on other components, which themselves have dependencies 
   (see appendix 2).   



3. Proposal for HDF5 based data format for canSAS-2D

a. The necessary identifiers for treated data were discussed for canSAS-1D.
   The majority of items were optional.
   It is necessary to extend the information concerning the actual mapped
   data for canSAS-2D.  

b. The prescription should be extensible. Space requirements are less important
   than ease of access, dumping sections and browsing.

c. The scattering data could be stored as a sequence of data items  or
   It would be useful for the first data component in a group to be a 
   2-D plottable array of corrected intensity, or simply identify with the NeXus attribute
   of Signal=1.  Another option option might be 
   to interpolate this (simple bi-linear, or more sophisticated splines) 
   onto a regular grid linear in x and y.  The plots would then be easier 
   to compare and merge.
   Additional data items could include scan variables, short identifier
   titles etc. 
   
d. In addition to the intensities there could be additional similar maps
   of intensity deviation, x-deviation, y-deviation, etc.  The notion of 
   links allows the other data items from the first map to be included 
   without needing copies, hence each component appears similar in "shape".

e. The data could be stored additionally as a sequence of tuples of
    (S, S_dev, x, x_dev, y, y_dev, polarisation, etc..)  This would allow
    different measurements to be sorted and merged, for example in TOF
    measurements several wavelength bands could be saved separately.
    
    With the data it is also useful to include the component information 
    and treatment information.  An attribute can contain muli-line text easily.  


f. Since a new set of classes would have to be created for NeXus anyway
   there seems little point in using/maintaining another package to
   read/write data.  It is easy to decorate the hdf5 file with the
   few attributes expected in a NeXus file.  
   
   a simple structure might have the following layout
   
   canSAS-2D
     Data
       S(Qx,Qy) (say 128x128,10,2)
         Attributes X="Qx_scale", Y="Qy_scale", Units="1/cm", Scan1="Temperature"
         Scan2="Polarisation", Legend="short_title", Interpolation="None"      
         Process="multi-line analysis summary", Signal=1
       Sdev(Qx,Qy) (128,128,10,2)
       Qx_scale (128)
       Qx_scale_dev (128)
       Qy_scale (128)
       Qy_scale_dev(128)
       Temperature(10)
       Polarisation(2)
     
     Sample
       Main_Title
      


4. Data browsing for HDF files

Browsers can open files and restructure internal components; most importantly
they can not only list the contents of data fields, but usually have several
plotting options for 1D cuts, 2D maps etc.

They include:
HDFView http://www.hdfgroup.org/hdf-java-html/hdfview/
HDFExplorer (semi-commercial product $39)
ISAW (?)
PyMca   http://pymca.sourceforge.net

HDF5 Data manipulation
h5py http://code.google.com/p/h5py
The HDF5 library includes working interfaces and examples for C, C++,
F77, F90 and java
The HDF5 library is incorporated into Mathlab, IDL etc.

The NeXus library is built on top of HDF5 and has support for python
and C.  The fortran90 works only on Windows.   


Examples

(One weakness of the NeXus project was the paucity of examples, hence
the divergence in local implementations.)

Typical example files might include 
 - simple SAS from radially symmetric data to check interpolation procedures
    etc.
 - GSAS having a pattern offset from nominal detector centre
 
 etc. 


Appendix 1  Annotated summary of NeXus raw data file

Structure of Nexus HDF5 file from HDFView

top level 054289.nxs
   group size 1
   4 attributes
   HDF5_Version = 1.8.3
   NeXus_Version = 4.2.0
   file_name = /users/data/054289.nxs
   file_time = 2012-04-19T11:12:34+01:00
   
   entry0
    Group size = 14
    Number of attributes = 1
      NX_class = NXentry			! standard starting point for
      						! NeXus files
      
      D22                                  	! instrument name
      Group size = 14
      Number of attributes = 1
        NX_class = NXinstrument			!instrument components & values 
	
	BS
	Group size = 12
	Number of attributes = 1
	NX_class = NXbeamstop
	  bx_actual
	  32-bit floating-point
	  Number of attributes = 0
	    Value = 1.81               ! (no units!)
	    
	  bx_offset  etc
	Detector
	Group size = .....  
	
	etc. for each attenuator, collimation, selector ...  
     data
     Group size = 1
     Number of attributes = 1
       NX_class = NXdata
       data
       32-bit integer, 128 x 128 x 1
       Number of attributes = 1
         signal = 1				! shows plottable data entity
	   value	   table[,,]
	   
     sample
     Group size = 20
     Number of attributes = 1
        NX_class = NXsample
	  temperature			
	  32-bit floating-point
	  Number or attributes = 0
	  san_actual 				! sample rotation angle
	  32-bit floating-point
	  Number or attributes = 0


Appendix 2 installing and running programs on different systems using gcc

The dependencies for hdf based software are illustrated below.
 
Macintosh OSX, Leopard 10.5.8 (2008) with Xcode, gfortran, gcc 4.2.3
Linux: Fedora Core-11 (2009) gfortran, gcc 4.4

HDFView 2.7   for pre-2010 systems, v2.8 current
HDF5 version 1.8.5 for pre-2010 systems..currently 1.8.9
     h5py 1.3.1.tar.gz  (requires python 2.5-2.6, HDF5 1.6.5 to 1.8.5)
      requires HDF5 built without Fortran support (needs shared libraries)
     NeXus-4.2.1 
        Notes:   requires mxml package, mxml-2.7.tar.gz
	
Several binary packages for NeXus (.dmg, .rpm failed dependencies ) all
finally rebuilt from source.

To build each component requires inspecting the INSTALL information
to create a suitable set of libraries.  The HDF5 package built
easily though one of the checks in the x86_64 linux package stopped
the system (large number test).
     

Windows - installation binaries Windows-XP for MinGW
    zlib-      built by MSYS
    hdf5-1.8.5 built by MSYS -- manually: -lws2_32 added to link library list 
    (a binary distribution exists for Intel compilers)
    python-2.7.3.msi
    numpy-1.6.2.win32-py2.7.exe
    h5py-2.0.1.win32-py2.7.msi
    NeXus-4.2.0.zip requires mxml
        mxml mxml-2.7.tar.gz  hand-built libmxml.a (without MSYS)

There are severe problems in using the most recent versions perhaps
linked to the changes for the x86_64 architecture.  The pre-built
NeXus packages all depend on the hdf4 and hdf5 and mxml libraries.
The last would require installing the MSYS package, or hand-building.

Appendix 3       Summary of test file cd2_050506_001.h5

Data are from mondisperse spheres (A. Rennie) D22, run 50506+

h5dump -n cd2_050506_001.h5

HDF5 "cd2_050506_001.h5" {
FILE_CONTENTS {
 group      /
 group      /canSAS2D
 group      /canSAS2D/ASample
 dataset    /canSAS2D/ASample/Title
 group      /canSAS2D/Data
 dataset    /canSAS2D/Data/Qx
 dataset    /canSAS2D/Data/Qy
 dataset    /canSAS2D/Data/S
 dataset    /canSAS2D/Data/Sdev
 }
}
showing the structure in more detail with NeXus decorations:

h5dump -A cd2_050506_001.h5

HDF5 "cd2_050506_001.h5" {
GROUP "/" {
   GROUP "canSAS2D" {
      GROUP "ASample" {
         ATTRIBUTE "NXclass" {
            DATATYPE  H5T_STRING {
                  STRSIZE 8;
                  STRPAD H5T_STR_SPACEPAD;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               }
            DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
            DATA {
            (0): "NXsample"
            }
         }
         DATASET "Title" {
            DATATYPE  H5T_STRING {
                  STRSIZE 50;
                  STRPAD H5T_STR_SPACEPAD;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               }
            DATASPACE  SCALAR
         }
      }
      GROUP "Data" {
         DATASET "Qx" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 128 ) / ( 128 ) }
            ATTRIBUTE "Units" {
               DATATYPE  H5T_STRING {
                     STRSIZE 3;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "1/A"
               }
            }
         }
         DATASET "Qy" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 128 ) / ( 128 ) }
            ATTRIBUTE "Units" {
               DATATYPE  H5T_STRING {
                     STRSIZE 3;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "1/A"
               }
            }
         }
         DATASET "S" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 128, 128 ) / ( 128, 128 ) }
            ATTRIBUTE "Interpolation" {
               DATATYPE  H5T_STRING {
                     STRSIZE 4;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "None"
               }
            }
            ATTRIBUTE "NXclass" {
               DATATYPE  H5T_STRING {
                     STRSIZE 6;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "NXdata"
               }
            }
            ATTRIBUTE "Process" {
               DATATYPE  H5T_STRING {
                     STRSIZE 80;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
               DATA {
               (0): "Created by apl8  27-Jun-2012 22:01:09    MASK: m12a.msk                         ",
               (1): " AvA1 0.0000E+00 AsA2 8.2300E-01 XvA3 0.0000E+00 XsA4 8.2300E-02 XfA5 0.0000E+00",
               (2): "S... 50506  0  6.80E+02 sple A 0.4%     Sbak 50505  0  6.79E+02 MT cell         ",
               (3): "Cd/E 50510  0  3.40E+02 blocked beam                                            ",
               (4): "                                                                                "
               }
            }
            ATTRIBUTE "Signal" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): 1
               }
            }
            ATTRIBUTE "Units" {
               DATATYPE  H5T_STRING {
                     STRSIZE 4;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "1/cm"
               }
            }
            ATTRIBUTE "x_scale" {
               DATATYPE  H5T_STRING {
                     STRSIZE 2;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "Qx"
               }
            }
            ATTRIBUTE "y_scale" {
               DATATYPE  H5T_STRING {
                     STRSIZE 2;
                     STRPAD H5T_STR_SPACEPAD;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): "Qy"
               }
            }
         }
         DATASET "Sdev" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 128, 128 ) / ( 128, 128 ) }
         }
      }
   }
}
}

python h5toText.py cd2_050506_001.h5 
cd2_050506_001.h5
  canSAS2D
    ASample
      @NXclass = ['NXsample']
      Title:char[50][] = __array
        __array = 
    Data
      Qx:float32[128] = __array
        @Units = ['1/A']
        __array = [-0.0093729430809617043, -0.0091350506991147995, -0.0088971592485904694, '...', 0.020839333534240723]
      Qy:float32[128] = __array
        @Units = ['1/A']
        __array = [-0.015177506022155285, -0.01493961364030838, -0.01470172218978405, '...', 0.015034771524369717]
      S:float32[128,128] = __array
        @NXclass = ['NXdata']
        @x_scale = ['Qx']
        @y_scale = ['Qy']
        @Interpolation = ['None']
        @Units = ['1/cm']
        @Signal = [1]
        @Process = ['Created by apl8  27-Jun-2012 22:01:09    MASK: m12a.msk'
 ' AvA1 0.0000E+00 AsA2 8.2300E-01 XvA3 0.0000E+00 XsA4 8.2300E-02 XfA5 0.0000E+00'
 'S... 50506  0  6.80E+02 sple A 0.4%     Sbak 50505  0  6.79E+02 MT cell'
 'Cd/E 50510  0  3.40E+02 blocked beam' '']
        __array = [
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0651400014758, '...', 0.0]
            ...
            [0.0, 0.0, 0.0, '...', 0.0]
          ]
      Sdev:float32[128,128] = __array
        __array = [
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0420599989593, '...', 0.0]
            ...
            [0.0, 0.0, 0.0, '...', 0.0]
          ]


The file is easily read and dumped by h5dump,
and may be plotted with HDFView, and PyMCA.

The file size is 140768 bytes for 128x128 data and errors
The ASCII orginal data are 370694 bytes 

The example data file may be obtained from

ftp://ftp.ill.fr/pub/cs/reg/canSAS2D/cd2_050506_001.h5


Ron Ghosh 10:51, 2 July 2012 (CDT)


We have had quite a serious issue with HDF5 - it has often issues when you are trying to read and write from NFS mounted network drives. For example, your instrument acquisition is trying to write a (read only) datafile to a network drive. However, you are trying to access it at the same time. Sometimes this leads to data corruption in the HDF5 file. This may not immediately be an issue for reduced datasets, but it is still a deficiency of the hdf5 approach. I do like the h5py approach to accessing for hdf5 files. However, at the end of the day hdf5 is binary. And it is hierarchical - you simply cannot change much about the file format once it is written. Plus, binary is simply a step too far for me. It takes me away from the data and I don't want that. I remember we had a big discussion at the last canSAS meeting about the iron rule that it had to be readable in Excel. I'm pretty sure you can't read HDF5 from Excel. Any format that is decided on simply has to be able to cope with 1D, 2D and all the other bells and whistles. I don't think that HDF5 gives the user the easy access they need to the data, nor does it give any flexibility for future changes.

Andyfaff 11:51, 27 July 2012 (CDT)