SLURM Integration
=================

NL-BIOMERO integrates with High-Performance Computing (HPC) clusters using SLURM for scalable bioimage analysis workflows.

Overview
--------

The SLURM integration allows you to:

* Execute computationally intensive workflows on HPC clusters
* Scale analysis across multiple compute nodes
* Leverage specialized hardware (GPUs, high-memory nodes)
* Manage workflow queuing and resource allocation

Architecture
------------

.. code-block:: text

   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
   │   OMERO Web     │    │  BIOMERO Worker │    │  SLURM Cluster  │
   │                 │    │                 │    │                 │
   │ ┌─────────────┐ │    │ ┌─────────────┐ │    │ ┌─────────────┐ │
   │ │ User submits│ │───▶│ │ Workflow    │ │───▶│ │ Job Queue   │ │
   │ │ workflow    │ │    │ │ Manager     │ │    │ │             │ │
   │ └─────────────┘ │    │ └─────────────┘ │    │ └─────────────┘ │
   │                 │    │        │        │    │        │        │
   │ ┌─────────────┐ │    │        ▼        │    │        ▼        │
   │ │ Results     │ │◀───│ ┌─────────────┐ │    │ ┌─────────────┐ │
   │ │ Display     │ │    │ │ Progress    │ │◀───│ │ Compute     │ │
   │ └─────────────┘ │    │ │ Tracking    │ │    │ │ Nodes       │ │
   └─────────────────┘    │ └─────────────┘ │    │ └─────────────┘ │
                          └─────────────────┘    └─────────────────┘

Configuration
-------------

SLURM integration is configured through ``slurm-config.ini`` files located in:

* **Web interface**: ``/NL-BIOMERO/web/slurm-config.ini``
* **Worker service**: ``/NL-BIOMERO/biomeroworker/slurm-config.ini``

Basic Configuration
~~~~~~~~~~~~~~~~~~~

.. code-block:: ini

   [SSH]
   # SLURM cluster connection
   host=localslurm

   [SLURM]
   # Storage paths on SLURM cluster
   slurm_data_path=/data/my-scratch/data
   slurm_images_path=/data/my-scratch/singularity_images/workflows
   slurm_converters_path=/data/my-scratch/singularity_images/converters
   slurm_script_path=/data/my-scratch/slurm-scripts

Container Environment Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For environments requiring explicit container path binding:

.. code-block:: ini

   [SLURM]
   # Required when containers need explicit path binding
   # Sets APPTAINER_BINDPATH environment variable
   slurm_data_bind_path=/data/my-scratch/data
   
   # Optional: specify partition for conversion jobs
   slurm_conversion_partition=cpu-short

.. note::
   Configure ``slurm_data_bind_path`` only when your HPC administrator requires setting the ``APPTAINER_BINDPATH`` environment variable.

Workflow Definitions
~~~~~~~~~~~~~~~~~~~~

Available workflows are defined in the ``[MODELS]`` section:

.. code-block:: ini

   [MODELS]
   # Cellpose segmentation workflow
   cellpose=cellpose
   cellpose_repo=https://github.com/TorecLuik/W_NucleiSegmentation-Cellpose/tree/v1.4.0
   cellpose_job=jobs/cellpose.sh
   cellpose_job_mem=4GB

   # StarDist segmentation workflow  
   stardist=stardist
   stardist_repo=https://github.com/Neubias-WG5/W_NucleiSegmentation-Stardist/tree/v1.3.2
   stardist_job=jobs/stardist.sh

Analytics and Monitoring
~~~~~~~~~~~~~~~~~~~~~~~~~

Enable workflow tracking and analytics:

.. code-block:: ini

   [ANALYTICS]
   # Enable workflow tracking
   track_workflows=True
   
   # Enable specific monitoring features
   enable_job_accounting=True
   enable_job_progress=True
   enable_workflow_analytics=True

Deployment Considerations
-------------------------

SSH Configuration
~~~~~~~~~~~~~~~~~

Configure SSH access to your SLURM cluster:

.. code-block:: bash

   # In ~/.ssh/config
   Host localslurm
       HostName your-slurm-cluster.example.com
       User your-username
       IdentityFile ~/.ssh/id_rsa_slurm
       Port 22

Directory Structure
~~~~~~~~~~~~~~~~~~~

Ensure required directories exist on the SLURM cluster:

.. code-block:: bash

   # Create directory structure
   mkdir -p /data/my-scratch/{data,singularity_images/{workflows,converters},slurm-scripts}

Permissions and Access
~~~~~~~~~~~~~~~~~~~~~~

* Verify the BIOMERO worker can SSH to the SLURM cluster
* Ensure read/write access to configured directories
* Check SLURM account permissions and quotas

Troubleshooting
---------------

Common Issues
~~~~~~~~~~~~~

**Container Access Errors**

If workflows fail with file access errors:

1. Configure explicit path binding:

   .. code-block:: ini

      [SLURM]
      slurm_data_bind_path=/data/my-scratch/data

2. Verify directory permissions on the SLURM cluster
3. Check if Singularity/Apptainer can access the data directory

**SSH Connection Failures**

If the worker cannot connect to SLURM:

1. Test SSH connection manually from the worker container
2. Verify SSH key authentication
3. Check network connectivity and firewall rules

**Job Submission Issues**

If jobs fail to submit:

1. Verify SLURM account and partition access
2. Check resource request limits (memory, GPU, etc.)
3. Review SLURM queue policies and restrictions

**Workflow Execution Failures**

If submitted jobs fail during execution:

1. Check SLURM job logs for errors
2. Verify container images are accessible
3. Ensure input data is properly transferred

Debug Commands
~~~~~~~~~~~~~~

.. code-block:: bash

   # Test SSH connection
   docker exec -it biomeroworker ssh localslurm
   
   # Check SLURM status
   docker exec -it biomeroworker ssh localslurm "squeue -u $USER"
   
   # View job details
   docker exec -it biomeroworker ssh localslurm "scontrol show job JOBID"
   
   # Check directory permissions
   docker exec -it biomeroworker ssh localslurm "ls -la /data/my-scratch/"

Performance Tuning
-------------------

Resource Allocation
~~~~~~~~~~~~~~~~~~~

Optimize resource requests for different workflow types:

.. code-block:: ini

   [MODELS]
   # CPU-intensive workflow
   cellprofiler_job_mem=32GB
   cellprofiler_job_time=02:00:00
   
   # GPU workflow
   cellpose_job_gres=gpu:1g.10gb:1
   cellpose_job_partition=gpu-partition
   
   # Memory-intensive workflow
   stardist_job_mem=64GB
   stardist_job_partition=himem

Queue Management
~~~~~~~~~~~~~~~~

* Use appropriate partitions for different workflow types
* Configure job time limits based on expected runtime
* Consider using job arrays for batch processing

Monitoring and Analytics
~~~~~~~~~~~~~~~~~~~~~~~~

Enable comprehensive monitoring:

.. code-block:: ini

   [ANALYTICS]
   track_workflows=True
   enable_job_accounting=True
   enable_job_progress=True
   enable_workflow_analytics=True
   
   # Optional: specify analytics database
   sqlalchemy_url=postgresql://user:pass@db:5432/analytics

Security Considerations
-----------------------

* Use SSH key authentication instead of passwords
* Restrict SSH access to specific users and commands
* Configure firewall rules to limit network access
* Regularly rotate SSH keys and credentials
* Monitor access logs for suspicious activity

Further Reading
---------------

* :doc:`../developer/containers/biomeroworker` - Worker container details
* :doc:`omero-biomero-admin` - Administrative procedures
* `SLURM Documentation <https://slurm.schedmd.com/documentation.html>`_
* `Singularity User Guide <https://docs.sylabs.io/guides/latest/user-guide/>`_