Email ARJN ARR
Editors' note: Below is an email exchange between Andrew Nelson and Adrian Rennie reproduced here (with light editing) with permission from both. AJJ
The emails reference several files which can be found here:
Email to Adrian: We currently use "NeXUS" format for raw data files. The format uses HDF files for storage. This is not so bad for a raw file, but is not suitable for reduced datasets. I strongly argue that ascii based files should be used for reduced datasets. As soon as you move towards binary files you remove the ability of a scientist to interact with the data in their own way. I strongly believe that the previous canSAS format was not developed enough. We should've dealt with multidimensional data and kinetic measurements 5 years ago. Certainly any new format should deal with: 1) multidimensional data (1 and 2D,....., nD). 2) Kinetic measurements (lots of measurements in a single file) 3) Polarised spin channels (all 4 of them) I attach a couple of files for your perusal. These are the 1D and 2D (offspecular) versions that I currently use for reflectometry. I did try and get other facilities interested, but it seems that reflectometrists are not really interested in working together on this kind of thing. (open the files in Chrome or Firefox) THings to notice (offspec file): 1) the REFdata node says that the rank is 2. THis means you have two primary axes associated with an <R> node. The axes are notified in the axes attribute. You could have as many axes as you wanted. 2) The type attribute says that it's point data, not histogram 3) the spin attribute says that it's unpolarised. 4) the dimensions attribute shows the length of each of the axes, so that you know how to distribute the Reflectivity values between the two axes. This is currently missing from the current 2D proposal file. 5) The <R> node has an uncertainty attribute that specifies which node (at the same level) contains the uncertainty for the <R> data. 6) Each of the axes <Qz> and <Qy> in this case also have an uncertainty node, which specifies associated uncertainties on the Qvalues. I think point 1) and 6) are vital, and allows for a degree of flexibility between facilities. Not everyone has <Qdev_parallel>, or <Qdev_perp>. THe only expected axis is <I> and attributes contained in <I> can then point to other stuff which may be present. One then has multiple <Idata> for kinetic measurements (or polarised data). The only trouble may be file size, some 2D images may be sparse. But I don't think that's a big issue. The strongest advice I can give is to remove as instrument related stuff as possible, the data presented should be instrument agnostic. If this doesn't happen then the job of writing an instrument independent analysis program is much harder. All corrections should've occurred before this point. i welcome comments on the file outline that we use. >>Adrian's reply: Thank you for your comments. You make a number of good points - I think that a main idea of the canSAS treated format was that it should allow interpretation of reduced data without needing further software to deal again with instrument specific data reduction. It was designed to be extensible so as to allow people to add further fields for example that contain necessary sample environment conditions without disrupting the ability to read Q, I(Q), errors, resolution in a general way. I think that the small-angle scattering community have many issues in common with reflection/surface scattering and also need to think about ways to describe polarisation etc. For example, depending on the hardware, this may also need to include efficiency of the analyser that may change during measurements for a 3He cell. There is some interest in both reflection and SANS in more general descriptions of spin for reflection (full vector description of incoming and outgoing beams) and future proofing might include possibilities for this in data files. If some people want to retain instrument configurations and reduction history that is not really a difficulty as the files are unlikely to become large by modern standards just because of these extra fields. Of course the extra information is not 'required' although what is needed is really determined by the next step - the analysis software. As some variable that is important in one experiment may be undefined in another, the flexibility within a standard is valuable. Have you developed any schema or style sheets for your xml files? It would be really useful to have the descriptions and definitions as well as example files. As a brief further comment on data files, it is possible that there will be requirements for information about Q resolution in several directions even if the data is a one dimensional array. Two or more dimensional data sets may also need to have more dimensions of resolution. If the data consists of sequences that are recorded for different times or for different positions in samples, there will need to be the possibility to describe resolution or uncertainty in these quantities. I understand that the X-ray community may be more concerned with very large data sets, for example tomographic scans or many time frames of two-dimensional scattering data. This has provided more challenges. My reply to Adrian: I think the key points to address are: 1) How do you deal with polarisation. Here you will have to define axes on the instrument, e.g. Qz being along the beam, Qx being up, Qy across. This is so you can define the polarisation directions, and the applied field vector. Such things are necessary. However, I would remain very wary about including things such as time constants of He3 cells. Inevitably such things are instrument specific. It would be really hard for a generic analysis program to handle all the instrument specific things. 2) Is there a generic way of handling resolution across instrument space? This gets harder in 2D space and it was already hard for 1D data. NIST has different data for pinhole compared to e.g. ISIS. But then there is also the slit smearing for USANS, etc. This should be thought of from the viewpoint of a scientist writing a generic analysis program (SANSview). How does that person deal with resolution functions from NIST USANS, NIST SANS, ISIS SANS, Diamond/APS SAXS, etc, in a simplified way? Perhaps the various ways of tabulating and dealing with resolution data should be delimited. i.e. if you have a file with pinhole collimation you expect certain arrays (method1). If you have slit smeared data (method2) you have another set of arrays. You extend the number of methods to cope with all the different instrument designs. Then we have some document that specifies which methods currently exist, and how to deal with the resolution from those instruments. 3) I suspect multiple timeframes with 2D data will have the ability to produce very large file sizes. This is the direction neutrons are heading with listmode data, but this should already be here for X-rays. Here you have the dilemma of having a file that is easily accessible (ascii), but that is also smallish (compression). One possible approach is to have a zip file as the overall reduced file, with the following structure: reducedFile.sasz (compressed archive, z indicates compression) | |-> reducedFile.xml ("normal" xml output, but the Idata node contains a link to the required Q/I/dQ/dI data) | |---->Idata0.txt (contains data for Q, I, dI, dQ) |---->Idata1.txt |---->Idata2.txt Each of these points (and the others that will come) requires a great deal of thought. I'm not convinced that creating something usable is possible in the time available. In fact as I've been writing this email I am starting to think more and more about how we are going about designing this format. I was speaking to our resident CIF expert, James Hester, and he is convinced of a need to go away from hierarchical formats towards a relational format. http://en.wikipedia.org/wiki/Relational_database http://en.wikipedia.org/wiki/Hierarchical_database_model The trouble with the way that we are designing this format is that there are always going to be requirements that we have not foreseen. With a hierarchical format, such as we have at the moment, the design is relatively inflexible, it has to have a certain structure determined by e.g. the XML schema. As soon as you are dealing with polarised, timedependent, 2D images, with resolution functions ..., you start tying yourself in knots. THis problem is worse if you go down the NeXUS route, those HDF formats are written in stone with little flexibility. The relational model says you just stuff information into the file. If your file reader can deal with it, fine. If it can't it just ignores it. James Hester submitted an abstract for SAS2012 on CIF data formats. He has a huge amount of experience in this field. I have no schema for the xml file format that I currently use. I am the only one who writes data (and therefore reads) with this format. The format should be relatively easy to understand, I outlined it's construction in my previous email. The only thing that the offspec file didn't include was resolution calculations for Qz and Qx. dQz is easy to calculate. dQx a little harder.