Email ARJN ARR

From canSAS

Editors' note: Below is an email exchange between Andrew Nelson and Adrian Rennie reproduced here (with light editing) with permission from both. AJJ

The emails reference several files which can be found here:

c_PLP0000708.xml - 1D reflectometry data example from ANSTO

off_PLP0000708.xml - 2D reflectometry data example from ANSTO


Email to Adrian:

We currently use "NeXUS" format for raw data files. The format uses
HDF files for storage.  This is not so bad for a raw file, but is not
suitable for reduced datasets.  I strongly argue that ascii based
files should be used for reduced datasets.  As soon as you move
towards binary files you remove the ability of a scientist to interact
with the data in their own way. 

I strongly believe that the previous canSAS format was not developed
enough.  We should've dealt with multidimensional data and kinetic
measurements 5 years ago.

Certainly any new format should deal with:

1) multidimensional data (1 and 2D,....., nD).
2) Kinetic measurements (lots of measurements in a single file)
3) Polarised spin channels (all 4 of them)

I attach a couple of files for your perusal. These are the 1D and 2D
(offspecular) versions that I currently use for reflectometry.  I did
try and get other facilities interested, but it seems that
reflectometrists are not really interested in working together on this
kind of thing.
(open the files in Chrome or Firefox)

THings to notice (offspec file):
1) the REFdata node says that the rank is 2. THis means you have two
primary axes associated with an <R> node. The axes are notified in the
axes attribute. You could have as many axes as you wanted.
2) The type attribute says that it's point data, not histogram
3) the spin attribute says that it's unpolarised.
4) the dimensions attribute shows the length of each of the axes, so
that you know how to distribute the Reflectivity values between the
two axes. This is currently missing from the current 2D proposal file.
5) The <R> node has an uncertainty attribute that specifies which node
(at the same level) contains the uncertainty for the <R> data.
6) Each of the axes <Qz> and <Qy> in this case also have an
uncertainty node, which specifies associated uncertainties on the
Qvalues.

I think point 1) and 6) are vital, and allows for a degree of
flexibility between facilities. Not everyone has <Qdev_parallel>, or
<Qdev_perp>.  THe only expected axis is <I> and attributes contained
in <I> can then point to other stuff which may be present.

One then has multiple <Idata> for kinetic measurements (or polarised
data).  The only trouble may be file size, some 2D images may be
sparse. But I don't think that's a big issue.

The strongest advice I can give is to remove as instrument related
stuff as possible, the data presented should be instrument agnostic.
If this doesn't happen then the job of writing an instrument
independent analysis program is much harder. All corrections should've
occurred before this point.

i welcome comments on the file outline that we use.

>>Adrian's reply:

Thank you for your comments.  You make a number of good points - I think 
that a main idea of the canSAS treated format was that it should allow interpretation 
of reduced data without needing further software to deal again with instrument 
specific data reduction.  It was designed to be extensible so as to allow people to add 
further fields for example that contain necessary sample environment conditions 
without disrupting the ability to read Q, I(Q), errors, resolution in a general way.  

I think that the small-angle scattering community have many issues in common with 
reflection/surface scattering and also need to think about ways to describe polarisation etc.  
For example, depending on the hardware, this may also need to include efficiency 
of the analyser that may change during measurements for a 3He cell.  There is some
interest in both reflection and SANS in more general descriptions of spin for reflection 
(full vector description of incoming and outgoing beams) and future proofing might
 include possibilities for this in data files.

If some people want to retain instrument configurations and reduction history that is 
not really a difficulty as the files are unlikely to become large by modern standards just
 because of these extra fields.  Of course the extra information is not 'required' 
although what is needed is really determined by the next step - the analysis software.  
As some variable that is important in one experiment may be undefined in another,
 the flexibility within a standard is valuable.

Have you developed any schema or style sheets for your xml files?  It would be really 
useful to have the descriptions and definitions as well as example files.  As a brief 
further comment on data files, it is possible that there will be requirements for
information about Q resolution in several directions even if the data is a one dimensional 
array.  Two or more dimensional data sets may also need to have more dimensions of 
resolution.  If the data consists of sequences that are recorded for different times or 
for different positions in samples, there will need to be the possibility to describe 
resolution or uncertainty in these quantities.

I understand that the X-ray community may be more concerned with very large data sets, 
for example tomographic scans or many time frames of two-dimensional scattering data.  
This has provided more challenges.



My reply to Adrian:
I think the key points to address are:

1) How do you deal with polarisation. Here you will have to define
axes on the instrument, e.g. Qz being along the beam, Qx being up, Qy
across. This is so you can define the polarisation directions, and the
applied field vector. Such things are necessary.
However, I would remain very wary about including things such as time
constants of He3 cells. Inevitably such things are instrument
specific. It would be really hard for a generic analysis program to
handle all the instrument specific things.

2) Is there a generic way of handling resolution across instrument
space? This gets harder in 2D space and it was already hard for 1D
data. NIST has different data for pinhole compared to e.g. ISIS. But
then there is also the slit smearing for USANS, etc.  This should be
thought of from the viewpoint of a scientist writing a generic
analysis program (SANSview). How does that person deal with resolution
functions from NIST USANS, NIST SANS, ISIS SANS, Diamond/APS SAXS,
etc, in a simplified way?
Perhaps the various ways of tabulating and dealing with resolution
data should be delimited. i.e. if you have a file with pinhole
collimation you expect certain arrays (method1). If you have slit
smeared data (method2) you have another set of arrays. You extend the
number of methods to cope with all the different instrument designs.
Then we have some document that specifies which methods currently
exist, and how to deal with the resolution from those instruments.

3) I suspect multiple timeframes with 2D data will have the ability to
produce very large file sizes.  This is the direction neutrons are
heading with listmode data, but this should already be here for
X-rays. Here you have the dilemma of having a file that is easily
accessible (ascii), but that is also smallish (compression). One
possible approach is to have a zip file as the overall reduced file,
with the following structure:

reducedFile.sasz (compressed archive, z indicates compression)
|
|-> reducedFile.xml ("normal" xml output, but the Idata node contains
a link to the required Q/I/dQ/dI data)
|
|---->Idata0.txt (contains data for Q, I, dI, dQ)
|---->Idata1.txt
|---->Idata2.txt

Each of these points (and the others that will come) requires a great
deal of thought.  I'm not convinced that creating something usable is
possible in the time available.
In fact as I've been writing this email I am starting to think more
and more about how we are going about designing this format. I was
speaking to our resident CIF expert, James Hester, and he is convinced
of a need to go away from hierarchical formats towards a relational
format.

http://en.wikipedia.org/wiki/Relational_database
http://en.wikipedia.org/wiki/Hierarchical_database_model

The trouble with the way that we are designing this format is that
there are always going to be requirements that we have not foreseen.
With a hierarchical format, such as we have at the moment, the design
is relatively inflexible, it has to have a certain structure
determined by e.g. the XML schema. As soon as you are dealing with
polarised, timedependent, 2D images, with resolution functions ...,
you start tying yourself in knots.

THis problem is worse if you go down the NeXUS route, those  HDF
formats are written in stone with little flexibility. 

The relational model says you just stuff information into the file. If
your file reader can deal with it, fine. If it can't it just ignores
it.
James Hester submitted an abstract for SAS2012 on CIF data formats. He
has a huge amount of experience in this field.

I have no schema for the xml file format that I currently use. I am
the only one who writes data (and therefore reads) with this format.
The format should be relatively easy to understand, I outlined it's
construction in my previous email. The only thing that the offspec
file didn't include was resolution calculations for Qz and Qx. dQz is
easy to calculate. dQx a little harder.