Skip to content

Preparing environments for execution

Preparing a DataLad-enabled environment

On CHUV's cluster

When HPC is planned for processing, DataLad will be required on that system(s).

  • Start an interactive session on the HPC cluster

    Do not run the installation of Conda and DataLad in the login node

    HPC systems typically recommend using their login nodes only for tasks related to job submission, data management, and preparing jobscripts. Therefore, the execution of resource-intensive tasks such as fMRIPrep or building containers on login nodes can negatively impact the overall performance and responsiveness of the system for all users. Interactive sessions are a great alternative when available and should be used when creating the DataLad dataset. For example, in the case of systems operating SLURM, the following command would open a new interactive session:

    srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
    

  • Install DataLad. Generally, the most convenient and user-sandboxed installation (i.e., without requiring elevated permissions) can be achieved by using Conda, but other alternatives (such as lmod) can be equally valid:

    • Get and install Conda if it is not already deployed in the system:

      wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      bash Miniconda3-latest-Linux-x86_64.sh
      
    • Install DataLad:

      conda install -c conda-forge -y "datalad>=1.0" datalad-container
      
    • Check the availability and dependencies for a specific Python version (here we check 3.8.2):

      module spider Python/3.8.2
      
    • Load Python (please note ml below is a shorthand for module load)

      ml GCCcore/9.3.0 Python/3.8.2
      
    • Update pip:

      python -m pip --user -U pip
      
    • Install DataLad:

      python -m pip install --user "datalad>=1.0" datalad-container
      
  • Check datalad is properly installed, for instance:

    $ datalad --version
    datalad 1.0.0
    
    DataLad crashes (Conda installations)

    DataLad may fail with the following error:

    ImportError: cannot import name 'getargspec' from 'inspect' (/home/users/cprovins/miniconda3/lib/python3.11/inspect.py)
    

    In such a scenario, create a Conda environment with a lower version of Python, and re-install datalad

    conda create -n "datamgt" python=3.10
    conda activate datamgt
    conda install -c conda-forge datalad datalad-container
    

  • Configure your Git identity settings.

    cd ~
    git config --global --add user.name "Jane Doe"
    git config --global --add user.email doe@example.com
    

On UNIL's Curnagl

Do not run the installation on the login node

HPC systems typically recommend using their login nodes only for tasks related to job submission, data management, and preparing jobscripts. Therefore, the execution of resource-intensive tasks such as fMRIPrep or building containers on login nodes can negatively impact the overall performance and responsiveness of the system for all users. Interactive sessions are a great alternative when available and should be used when creating the DataLad dataset. For example, in the case of systems operating SLURM, the following command would open a new interactive session:

salloc --partition=interactive --time=02:00:00 --cpus-per-task 12
  • Install Micromamba following Curnagl's instructions:

    • Add the following two lines to your ~/.bashrc file:

      export PATH="$PATH:/dcsrsoft/spack/external/micromamba"
      export MAMBA_ROOT_PREFIX="/work/FAC/SCHOOL/INSTITUTE/PI/PROJECT/opt/mamba"
      

    • Instruct Micromamba to update your profile issuing the following command line:

      micromamba shell init
      

    • Log out and back in
  • Create a new environment called datamgt with Git annex in it:

    micromamba create -n datamgt python=3.12 git-annex=*=alldep*
    

  • Activate the environment
    micromamba activate datamgt
    
  • Install DataLad and DataLad-next:
    python -m pip install datalad datalad-next
    
  • Configure your Git identity settings.

    cd ~
    git config --global --add user.name "Jane Doe"
    git config --global --add user.email doe@example.com
    

Getting data

Installing the original HCPh dataset with DataLad

Wherever you want to process the data, you'll need to datalad install it before you can pull down (datalad get) the data. To access the metadata (e.g., sidecar JSON files of the BIDS structure), you'll need to have access to the git repository that corresponds to the data (https://github.com/<organization>/<repo_name>.git) To fetch the dataset from the RIA store, you will need your SSH key be added to the authorized keys at Curnagl.

Getting access to the RIA store

These steps must be done just once before you can access the dataset's data:

  • Create a secure SSH key on the system(s) on which you want to install the dataset.
  • Send the SSH public key you just generated (e.g., ~/.ssh/id_ed25519.pub) over email to Oscar at *@****.
  • Install the dataset:

    micromamba run -n datamgt datalad install https://github.com/<organization>/<repo_name>.git
    
  • Reconfigure the RIA store:

    micromamba run -n datamgt \
        git annex initremote --private --sameas=ria-storage \
        curnagl-storage type=external externaltype=ora encryption=none \
        url="ria+file://<path>"
    

    REQUIRED step

    When on Curnagl, you'll need to convert the ria-storage remote on a local ria-store because you cannot ssh from Curnagl into itself.

  • Get the dataset:

    Data MUST be fetched from a development node.

    The NAS is not accessible from compute nodes in Curnagl.

    • Execute datalad get within a development node:

      salloc --partition=interactive --time=02:00:00 --cpus-per-task 12
      

      Success is demonstrated by an output like:

      salloc: Granted job allocation 47734642
      salloc: Nodes dna064 are ready for job
      Switching to the 20240303 software stack
      
    • Fetch the data:

      cd $WORK/data
      micromamba run -n datamgt datalad get -J${SLURM_CPUS_PER_TASK} .
      

Installing derivatives

Derivatives are installed in a similar way:

  • Install the dataset:

    micromamba run -n datamgt datalad install https://github.com/<organization>/<repo_name>.git
    
  • Reconfigure the RIA store:

    micromamba run -n datamgt \
        git annex initremote --private --sameas=ria-storage \
        curnagl-storage type=external externaltype=ora encryption=none \
        url="ria+file://<path>"
    
  • Fetch the data

    salloc --partition=interactive --time=02:00:00 --cpus-per-task 12
    cd $WORK/data
    micromamba run -n datamgt datalad get -J${SLURM_CPUS_PER_TASK} .
    

Registering containers

We use DataLad containers-run to execute software while keeping track of provenance. Prior to first use, containers must be added to DataLad as follows (example for MRIQC):

  • Register the MRIQC container to the dataset

    datalad containers-add \
        --call-fmt 'singularity exec --cleanenv -B {{${HOME}/tmp/}}:/tmp {img} {cmd}' \
        mriqc \
        --url docker://nipreps/mriqc:23.1.0
    
    Insert relevant arguments to the singularity command line with --call-fmt

    In the example above, we configure the container's call to automatically bind (-B flag to mount the filesystem) the temporary folder. MRIQC will store the working directory there by default. Please replace the path with the appropriate path for your settings (i.e., laptop, cluster, etc.).

    datalad containers-add \
        --call-fmt 'docker run -u $( id -u ) -it -v {{${HOME}/tmp/}}:/tmp {img} {cmd}' \
        mriqc \
        --url docker://nipreps/mriqc:23.1.0
    
    Pinning a particular version of MRIQC

    If a different version of MRIQC should be executed, replace the Docker image's tag (23.1.0) with the adequate version tag within the above command line.