Skip to content

Before data acquisition (storage preparation)

DataLad must be version 1.0 or later

This project maintains data under version control thanks to DataLad1. For instructions on how to setup DataLad on your PC, please refer to the official documentation. When employing high-performance computing (HPC), we provide some specific guidelines.

Please read the DataLad Handbook, especially if you are new to this tool

Creating a DataLad dataset

  • Designate a host and folder where data will be centralized. In the context of this study, the primary copy of data will be downloaded into <hostname>, under the path /data/datasets/hcph-pilot-sourcedata for the piloting acquisitions and /data/datasets/hcph-sourcedata for the experimental data collection.
  • Install the bids DataLad procedure provided from this repository to facilitate the correct intake of data and metadata:

    PYTHON_SITE_PACKAGES=$( python -c 'import sysconfig; print(sysconfig.get_paths()["purelib"])' )
    ln -s <path>/code/datalad/cfg_bids.py ${PYTHON_SITE_PACKAGES}/datalad/resources/procedures/
    
    DataLad's documentation does not recommend this approach

    For safety, you can prefer to use DataLad's recommendations and place the cfg_bids.py file in some of the suggested paths.

  • Check the new procedure is available as bids:

    $ datalad run-procedure --discover
    cfg_bids (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_bids.py) [python_script]
    cfg_yoda (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_yoda.py) [python_script]
    cfg_metadatatypes (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_metadatatypes.py) [python_script]
    cfg_text2git (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_text2git.py) [python_script]
    cfg_noannex (/home/oesteban/.miniconda/lib/python3.9/site-packages/datalad/resources/procedures/cfg_noannex.py) [python_script]
    

    Learn more about the YODA principles (DataLad Handbook)

  • Create a DataLad dataset for the original dataset:

    cd /data/datasets/
    datalad create -c bids hcph-dataset
    
  • Configure a RIA store, where large files will be pushed (and pulled from when installing the dataset in other computers)

    Creating a RIA sibling to store large files
    cd hcph-dataset
    datalad create-sibling-ria -s ria-storage --alias hcph-dataset \
            --new-store-ok --storage-sibling=only \
            "ria+ssh://<username>@curnagl.dcsr.unil.ch:<absolute-path-of-store>"
    

    Getting [ERROR ] 'SSHRemoteIO' ...

    If you encounter:

    [ERROR ] 'SSHRemoteIO' object has no attribute 'url2transport_path'
    

    Type in the following Git configuration (datalad/datalad-next#754):

    git config --global --add datalad.extensions.load next
    
  • Configure a GitHub sibling, to host the Git history and the annex metadata:

    Creating a GitHub sibling to store DataLad's infrastructure and dataset's metadata
    datalad siblings add --dataset . --name github \
            --pushurl git@github.com:<organization>/<repo_name>.git \
            --url https://github.com/<organization>/<repo_name>.git \
            --publish-depends ria-storage
    

Synchronizing your DataLad dataset

Once the dataset is installed, new sessions will be added as data collection goes on. When a new session is added, your DataLad dataset will remain at the same point in history (meaning, it will become out-of-date).

  • Pull new changes in the git history. DataLad will first fetch Git remotes and merge for you.

    cd hcph-dataset/  # <--- cd into the dataset's path
    datalad update -r --how ff-only
    
  • If you need the data, now you can get the data as usual:

    find sub-001/ses-pilot019 -name "*.nii.gz" | xargs datalad get -J 8
    

Adding data or metadata

  • Use datalad save indicating the paths you want to add, and include --to-git if the file contains only metadata (e.g., JSON files).

    find sub-001/ses-pilot019 -name "*.nii" -or -name "*.nii.gz" -or -name "*.tsv.gz" | \
        xargs datalad save -m '"add(pilot019): new session data (NIfTI and compressed TSV)"'
    
    find sub-001/ses-pilot019 -name "*.json" -or -name "*.tsv" -or -name "*.bvec" -or -name "*.bval" | \
        xargs datalad save -m --to-git '"add(pilot019): new session metadata (JSON, TSV, bvec/bval)"'
    

Preparing derivative subdatasets

Datalad's datasets modularity can be leveraged with derivatives. Instead of creating a monolithic and large dataset for all derivatives, we will create datasets for each of the derivatives and then aggregate them as subdatasets. The steps to create derivative subdatasets is similar to those of the original dataset. To make this documentation generalize across different derivative sets, we will first set a variable with the derivative set name. In this case, the steps are demonstrated for the outputs of sMRIPrep.

  • Initiate the derivative dataset:

    export DERIVS_REPO=smriprep
    datalad create -c bids hcph-${DERIVS_REPO}
    cd hcph-${DERIVS_REPO}
    

  • Create the RIA store:

    Creating a RIA sibling to store large files
    datalad create-sibling-ria -s ria-storage --alias hcph-${DERIVS_REPO} \
            --new-store-ok --storage-sibling=only \
            "ria+ssh://<username>@curnagl.dcsr.unil.ch:<absolute-path-of-store>"
    

    Getting [ERROR ] 'SSHRemoteIO' ...

    If you encounter:

    [ERROR ] 'SSHRemoteIO' object has no attribute 'url2transport_path'
    

    Type in the following Git configuration (datalad/datalad-next#754):

    git config --global --add datalad.extensions.load next
    
  • Configure the GitHub sibling:

    datalad create-sibling-github --dataset . -s github --existing error \
            --access-protocol https-ssh --private \
            --description "HCPh Derivatives: ${DERIVS_REPO}" \
            --publish-depends ria-storage \
            <organization>/<repo_name>
    

    If successfull, the output in this case should be something like:

    create_sibling_github(ok): [sibling repository 'github' created at https://github.com/<organization>/<repo_name>]
    configure-sibling(ok): . (sibling)
    action summary:
      configure-sibling (ok: 1)
      create_sibling_github (ok: 1)
    

    The repository should exist now in GitHub.

    datalad siblings add --dataset . --name github \
            --pushurl git@github.com:<organization>/<repo_name>.git \
            --url https://github.com/<organization>/<repo_name>.git \
            --publish-depends ria-storage
    
  • Push the configuration to the github sibling:

    datalad push --to=github
    

    The output should be something like:

    publish(ok): . (dataset) [refs/heads/master->github:refs/heads/master [new branch]]
    publish(ok): . (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
    
    action summary:
       publish (ok: 2)
    
  • Check that master is the default branch in the repository settings (in this case: https://github.com///settings)

Nesting derivatives datasets

With the structure of derivatives we have defined above, we will then nest datalad datasets to obtain consistent global derivatives. For example, let's nest the FreeSurfer derivatives into the sourcedata/freesurfer folder of the fMRIPrep derivatives of the piloting sessions with the reliability protocol.

  • Change directory into the superdataset folder, ensure it's updated, and create a new branch (for caution)

    export TOP_REPO=fmriprep-reliability-pilot
    export DERIVS_REPO=freesurfer-reliability-pilot
    
    cd hcph-${TOP_REPO}
    datalad update --how ff-only
    git checkout -b add/freesurfer-subdataset master
    

  • Install the subdataset (in this case into sourcedata/freesurfer)

    datalad install -d . \
                    -s https://github.com/<organization>/<repo_name>.git \
                    sourcedata/freesurfer
    

  • Push the changes to the super-dataset:

    datalad push --to=github
    

  • Visit the superdataset's GitHub repository:

    • Create a pull-request (PR)
    • Review the PR
    • Merge the PR
  • Back to the host for data management, update the dataset:

    cd hcph-${TOP_REPO}  # if necessary
    git checkout master
    datalad update --how ff-only