Databases Notes
From canSAS
Notes from canSAS 2024 Databases Topical Presentation/Discussion
Fair data
- there are checkers available for example www/f-uji.net/
Public databases:
- Catalog of fair databases: fairsharing.org brings up two databases, SASPDB and simplescattering.com
- Zenodo also contains 14k datasets, also software.
Local databases:
- SciCat: possibility to catalog datasets and instruments.
- Tiled: data access framework linked to BlueSky, future connection with scicat will be done.
What is needed
- A publicly accessible (scicat) catalog would be useful for labs to catalog public scattering datasets in.
- This could be taken up as part of the International Scattering Alliance, certainly falls within its remit. Needs a document to specify what is needed, what should be uploaded to it, and how much it would cost.
- Data to include in this would be (corrected, well-documented) data on easily accessible samples. This could include single phases (backgrounds) like water, hexane, toluene, acetone, PEG-DA, scotch tape, etc., and also reference samples such as the silver nanoparticle solutions. The datasets and samples should be well-described and ideally complete with a data processing graph.
- There should be a provision to add annotations (text, keywords, flags) to datasets, this could say a lot about the human interpretation of given datasets.
- This (and other well-documented databases) could have benefits for the third “user” group: the group of people that need your data to train ML systems.
- All databases need to have a good user interface that actually addresses the need of the user. Once you build that, and the user sees the benefit they get from this, they will come. For example, MX people cannot work without their dashboards. FAIR-compliant user interfaces need to become easy enough to use that it becomes prohibitively hard for the user not to follow the recommended pathway.
- We might need guidelines on how to describe a sample sufficiently well, so that they can be used in the analysis. This data, and the (corrected, fully described) data should be made available in a public repository at the latest at the time of publication.