Before data acquisition (storage preparation)
DataLad must be version 1.0 or later
This project maintains data under version control thanks to DataLad1. For instructions on how to setup DataLad on your PC, please refer to the official documentation. When employing high-performance computing (HPC), we provide some specific guidelines.
Please read the DataLad Handbook, especially if you are new to this tool
Creating a DataLad dataset¶
- Designate a host and folder where data will be centralized.
In the context of this study, the primary copy of data will be downloaded into <hostname>, under the path
/data/datasets/hcph-pilot-sourcedata
for the piloting acquisitions and/data/datasets/hcph-sourcedata
for the experimental data collection. -
Install the
bids
DataLad procedure provided from this repository to facilitate the correct intake of data and metadata:PYTHON_SITE_PACKAGES=$( python -c 'import sysconfig; print(sysconfig.get_paths()["purelib"])' ) ln -s <path>/code/datalad/cfg_bids.py ${PYTHON_SITE_PACKAGES}/datalad/resources/procedures/
DataLad's documentation does not recommend this approach
For safety, you can prefer to use DataLad's recommendations and place the
cfg_bids.py
file in some of the suggested paths. -
Check the new procedure is available as
bids
:$ datalad run-procedure --discover cfg_bids (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_bids.py) [python_script] cfg_yoda (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_yoda.py) [python_script] cfg_metadatatypes (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_metadatatypes.py) [python_script] cfg_text2git (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_text2git.py) [python_script] cfg_noannex (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_noannex.py) [python_script]
Learn more about the YODA principles (DataLad Handbook)
-
Create a DataLad dataset for the original dataset:
-
Configure a RIA store, where large files will be pushed (and pulled from when installing the dataset in other computers)
Creating a RIA sibling to store large filescd hcph-dataset datalad create-sibling-ria -s ria-storage --alias hcph-dataset \ --new-store-ok --storage-sibling=only \ "ria+ssh://<username>@curnagl.dcsr.unil.ch:<absolute-path-of-store>"
Getting
[ERROR ] 'SSHRemoteIO' ...
If you encounter:
Type in the following Git configuration (datalad/datalad-next#754):
-
Configure a GitHub sibling, to host the Git history and the annex metadata:
Synchronizing your DataLad dataset¶
Once the dataset is installed, new sessions will be added as data collection goes on. When a new session is added, your DataLad dataset will remain at the same point in history (meaning, it will become out-of-date).
-
Pull new changes in the git history. DataLad will first fetch Git remotes and merge for you.
-
If you need the data, now you can get the data as usual:
Adding data or metadata¶
-
Use
datalad save
indicating the paths you want to add, and include--to-git
if the file contains only metadata (e.g., JSON files).