Preparing environments for execution

Preparing a DataLad-enabled environment¶

On CHUV's cluster¶

When HPC is planned for processing, DataLad will be required on that system(s).

Start an interactive session on the HPC cluster
Do not run the installation of Conda and DataLad in the login node

HPC systems typically recommend using their login nodes only for tasks related to job submission, data management, and preparing jobscripts. Therefore, the execution of resource-intensive tasks such as fMRIPrep or building containers on login nodes can negatively impact the overall performance and responsiveness of the system for all users. Interactive sessions are a great alternative when available and should be used when creating the DataLad dataset. For example, in the case of systems operating SLURM, the following command would open a new interactive session:
```
srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
```
Install DataLad. Generally, the most convenient and user-sandboxed installation (i.e., without requiring elevated permissions) can be achieved by using Conda, but other alternatives (such as lmod) can be equally valid:
Install DataLad with CondaInstall DataLad in HPC with lmod enabled
- Get and install Conda if it is not already deployed in the system:
  
  wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
- Install DataLad:
  
  conda install -c conda-forge -y "datalad>=1.0" datalad-container
- Check the availability and dependencies for a specific Python version (here we check 3.8.2):
  
  module spider Python/3.8.2
- Load Python (please note ml below is a shorthand for module load)
  
  ml GCCcore/9.3.0 Python/3.8.2
- Update pip:
  
  python -m pip --user -U pip
- Install DataLad:
  
  python -m pip install --user "datalad>=1.0" datalad-container

Check datalad is properly installed, for instance:

$ datalad --version
datalad 1.0.0

DataLad crashes (Conda installations)

DataLad may fail with the following error:

ImportError: cannot import name 'getargspec' from 'inspect' (/home/users/cprovins/miniconda3/lib/python3.11/inspect.py)

In such a scenario, create a Conda environment with a lower version of Python, and re-install datalad

conda create -n "datamgt" python=3.10
conda activate datamgt
conda install -c conda-forge datalad datalad-container

Configure your Git identity settings.

cd ~
git config --global --add user.name "Jane Doe"
git config --global --add user.email doe@example.com

On UNIL's Curnagl¶

Do not run the installation on the login node

HPC systems typically recommend using their login nodes only for tasks related to job submission, data management, and preparing jobscripts. Therefore, the execution of resource-intensive tasks such as fMRIPrep or building containers on login nodes can negatively impact the overall performance and responsiveness of the system for all users. Interactive sessions are a great alternative when available and should be used when creating the DataLad dataset. For example, in the case of systems operating SLURM, the following command would open a new interactive session:

salloc --partition=interactive --time=02:00:00 --cpus-per-task 12

Install Micromamba following Curnagl's instructions:
- Add the following two lines to your ~/.bashrc file:
```
export PATH="$PATH:/dcsrsoft/spack/external/micromamba"
export MAMBA_ROOT_PREFIX="/work/FAC/SCHOOL/INSTITUTE/PI/PROJECT/opt/mamba"
```
- Instruct Micromamba to update your profile issuing the following command line:
```
micromamba shell init
```
- Log out and back in

Create a new environment called datamgt with Git annex in it:

micromamba create -n datamgt -c conda-forge python=3.12 git-annex=*=alldep*

Activate the environment
```
micromamba activate datamgt
```

Install DataLad and DataLad-next:

python -m pip install datalad datalad-next

Configure your Git identity settings.

cd ~
git config --global --add user.name "Jane Doe"
git config --global --add user.email doe@example.com

Getting data¶

Installing the original HCPh dataset with DataLad¶

Wherever you want to process the data, you'll need to datalad install it before you can pull down (datalad get) the data. To access the metadata (e.g., sidecar JSON files of the BIDS structure), you'll need to have access to the git repository that corresponds to the data (https://github.com/<organization>/<repo_name>.git) To fetch the dataset from the RIA store, you will need your SSH key be added to the authorized keys at Curnagl.

Getting access to the RIA store

These steps must be done just once before you can access the dataset's data:

Create a secure SSH key on the system(s) on which you want to install the dataset.
Send the SSH public key you just generated (e.g., ~/.ssh/id_ed25519.pub) over email to Oscar at *@****.

Install the dataset:

micromamba run -n datamgt datalad install https://github.com/<organization>/<repo_name>.git

Reconfigure the RIA store:

micromamba run -n datamgt \
    git annex initremote --private --sameas=ria-storage \
    curnagl-storage type=external externaltype=ora encryption=none \
    url="ria+file://<path>"

REQUIRED step

When on Curnagl, you'll need to convert the ria-storage remote on a local ria-store because you cannot ssh from Curnagl into itself.

Get the dataset:

Data MUST be fetched from a development node.

The NAS is not accessible from compute nodes in Curnagl.

Execute datalad get within a development node:

salloc --partition=interactive --time=02:00:00 --cpus-per-task 12

Success is demonstrated by an output like:

salloc: Granted job allocation 47734642
salloc: Nodes dna064 are ready for job
Switching to the 20240303 software stack

Fetch the data:

cd $WORK/data
micromamba run -n datamgt datalad get -J${SLURM_CPUS_PER_TASK} .

Installing derivatives¶

Derivatives are installed in a similar way:

Install the dataset:

micromamba run -n datamgt datalad install https://github.com/<organization>/<repo_name>.git

Reconfigure the RIA store:

micromamba run -n datamgt \
    git annex initremote --private --sameas=ria-storage \
    curnagl-storage type=external externaltype=ora encryption=none \
    url="ria+file://<path>"

Fetch the data

salloc --partition=interactive --time=02:00:00 --cpus-per-task 12
cd $WORK/data
micromamba run -n datamgt datalad get -J${SLURM_CPUS_PER_TASK} .

Registering containers¶

We use DataLad containers-run to execute software while keeping track of provenance. Prior to first use, containers must be added to DataLad as follows (example for MRIQC):

Register the MRIQC container to the dataset
Registering a Singularity containerRegistering a Docker container
datalad containers-add \ --call-fmt 'singularity exec --cleanenv -B {{${HOME}/tmp/}}:/tmp {img} {cmd}' \ mriqc \ --url docker://nipreps/mriqc:23.1.0
Insert relevant arguments to the singularity command line with --call-fmt

In the example above, we configure the container's call to automatically bind (-B flag to mount the filesystem) the temporary folder. MRIQC will store the working directory there by default. Please replace the path with the appropriate path for your settings (i.e., laptop, cluster, etc.).
datalad containers-add \ --call-fmt 'docker run -u $( id -u ) -it -v {{${HOME}/tmp/}}:/tmp {img} {cmd}' \ mriqc \ --url docker://nipreps/mriqc:23.1.0
Pinning a particular version of MRIQC

If a different version of MRIQC should be executed, replace the Docker image's tag (23.1.0) with the adequate version tag within the above command line.